Re: Who is using the Maven uimaFIT plugin in open source?
Hi! On Thu, Feb 04, 2016 at 10:11:00AM +0100, Richard Eckart de Castilho wrote: > I am looking for open source projects or at least publicly > distributed components that are using UIMA in conjunction > with Maven and with the uimaFIT Maven plugin. > > If you know or have such a project, it would be great if > you could post a link here. https://github.com/brmson/blanqa is not developed anymore, but uses Maven + uimaFIT. Some other OpenQA/OQA components prolly do too. https://github.com/brmson/yodaqa uses gradle + uimaFIT (from the maven repo), not sure if that qualifies. :-) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: Basic UIMA questions
Hi! On Thu, Jan 14, 2016 at 02:09:07PM +, Sean Crist wrote: > I have a few questions on the basic concepts of UIMA. It’s fine if you tell > me to read the manuals, but I haven’t been able to find the answers there so > far, so a chapter reference would be a big help. > > > > 1)If Annotator A creates an annotation, is it OK for Annotator B to > modify the information in the annotations which A created? Yes, that's fine. (I hope - maybe the rules change a little in distributed environment, and for some reason I always reindex the annotations, but that might not be necessary anymore - I'll let someone else fill in the details here.) > 2) I’ve read that an annotation can contain a reference to another > annotation, but I haven’t been able to find instructions or an example. > > Possibly, I could generate the annotation class using JCasGen, and then > manually augment the auto-generated code to support references to other > annotation objects. Is that a good way to do it? Or is there some kind of > built-in support? Sure, the feature type does not need to be a primitive UIMA type like uima.cas.Integer but also a reference to another featureset type like uima.tcas.Annotation (reference to an unspecified type of annotation) or de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Tokeno (reference to a particular type of annotation). JCas then handles all the resolution for you and the get...() function will return an instance of the correct JCas class of the referenced annotation. > 3) Suppose I want a parser to build a parse tree over tokens. A parse tree > consists of a hierarchy of nodes. > > I could represent each node as an annotation. Is that the most UIMA-like > solution? > > The reason I hesitate is this. If I were writing a non-UIMA solution from > scratch, I’d treat all of the nodes above the token level as abstract units, > and those abstract units wouldn’t deal in concrete information such as the > beginning and end of a character range. I’d keep track of that only at the > token level. I think that all UIMA annotations are required to keep track of > this information. > > Also, it sounds the only way for an annotator to retrieve existing > annotations is to create an iterator and pull them out one by one. I wish > there were a way to just get a reference to the root node of my parse tree, > so that I can simply step recursively through the tree (which assumes I’ve > arranged for each node to contain references to its children). Yes, you would represent each node as an annotation - or rather, each edge as an annotation (typically annotating the "receiving end"). That's exactly how e.g. DKpro does it when wrapping StanfordParser. It's not really painful to work with the tree this way, see e.g. https://github.com/brmson/yodaqa/blob/master/src/main/java/cz/brmlab/yodaqa/analysis/question/FocusGenerator.java for an example of code that applies a simple set of blackboard rules to a parse tree to find a focus of a question sentence. -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: [UK OFFICIAL] Baleen - UIMA Based Text Analytics Framework
Hi! On Mon, Sep 28, 2015 at 02:31:03PM +0100, Baker James D wrote: > I would like to draw your attention to a text analytics framework that has > just been released by Dstl (part of the UK Ministry of Defence). It uses UIMA > as part of its underlying architecture but provides additional functionality > on top of that, and simplifies much of the user configuration and experience, > as well as the development process. A number of collection readers, > annotators and consumers are included as part of the framework. > > The tool is called Baleen, and is released under Apache Software License 2. > > There is more information about the tool on the press release > (https://www.gov.uk/government/news/dstl-adds-to-open-source-software), and > on the GitHub page (https://github.com/dstl/baleen). Thanks for the heads up. However, I haven't found any clear summary of what is the framework capable of right now - I think you might want to expand the generic description a bit with some examples and use-cases. I have been looking around a bit and seems like e.g. https://github.com/dstl/baleen/blob/master/baleen/baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/MergeAdjacentQuantities.java is something that could be pretty useful, but you might want to make it easier to discover the capabilities to get more users / contributors. Best, Petr Baudis
Re: CAS merger/multiplier N:M mapping
Hi! On Sun, Sep 06, 2015 at 10:58:44AM -0400, Eddie Epstein wrote: > On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis <pa...@ucw.cz> wrote: > > (ii) Use an internal "intermediary" CAS instance in process() to which > > I append my sentences, then use it as a source of output CASes. Turns > > out (surprisingly) that I can't append to a sofa documenttext ("Data for > > Sofa feature setLocalSofaData() has already been set." - not sure about > > the reason for this restriction). > > > > The Sofa data for a view is immutable, otherwise existing annotations > could become invalid. But in my case, I'd only append to the end, so this concern is moot. It's rather easy anyway to make your annotations go invalid if you use CasCopier a bit. > > I think the only choice except downright unmaintainable hacks (like > > programatically generated M views) is to just give up on preserving my > > annotations and carry over just the sentence texts. Am I missing > > something? > > > > Creating a new view in the intermediate CAS for each of the N input CASes > would work. A new output CAS Sofa would be comprised of data from > multiple views and of course the annotation end points adjusted as when > added to the new output CAS. I guess that .getViewIterator() would make this not so frustrating, so I'll try this route, thanks for the tip! > One problem there is that the intermediate CAS would continue to grow > in size, so there would need to be some point when it could be reset. Indeed, well, when you output all M CASes is a good point. I assume .release() would accomplish this. > > (I'm somewhat tempted to cut my losses short (much too late) and > > abandon UIMA flow control altogether, using only simple pipelines and > > having custom glue code to connect these together, as it seems like > > getting the flow to work in interesting cases is a huge time sink and in > > retrospect, it could never pay off any abstract advantage of easier > > distributed processing (where you probably end up having to chop up the > > pipeline manually anyway). I would probably never recommend new UIMA > > users to strive for a single pipeline with CAS multipliers/mergers and > > begin to consider these features an evolutionary dead end rather than > > advantageous. Not sure if there even *are* any other real users using > > advanced flows besides me and DeepQA. I'll be glad to hear any opinions > > on this!) > > > > > Definitely the advantage to encapsulating analytics in standard UIMA > components is easy scalability via the vertical and horizontal scale out > options offered by UIMA-AS and DUCC. Flexibility in chopping up a > pipeline into services as needed is another advantage. But as far as I understand, you need to explicitly define and deploy AEs that are to be run on different machines anyway. So I'm not sure if the extra value is really that large in the end? > The previously mentioned GALE multimodal application also converted > sequences of N input CASes to M output CASes. In that case the input > CASes represented 2 minutes worth of speech-to-text transcription of > broadcast news, and each output CAS represented a single news story. > The story-CASes then went thru a pipeline that identified the story and > updated a pre-existing summarization for each story. Interesting (and good to hear), thanks! -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: UIMAj3 ideas
On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote: On 7/9/2015 6:52 PM, Petr Baudis wrote: snip... https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3 I didn't figure out how to edit that wiki page, Due to spammers, we had to turn off public editing. However, I can add you to a list ( to do this, you have to register for a user id on the wiki, and then send me offline what that Id is ), but even without being on the list, there's a comment button which (I think) lets you add comments at the bottom. but a mental summary of the things I find currently irritating about UIMA and would love to see changed formed in my mind, so I thought I could contribute it for discussion. Great! * UIMAfit is not part of core UIMA and UIMA-AS is not part of core UIMA. It seems to me that UIMA-AS is doing things a bit differently than what the original UIMA idea of doing scaleout was. The two things don't play well together. I'd love a way to easily take my plain UIMA pipeline and scale it out, ideally without any code changes, *and* avoid the terrible XML config files. Any specifics of what to change here would be helpful. UIMA-AS was designed to enable scale-out without changing the core UIMA pipeline or it's XML descriptor. THe additional information for UIMA-AS scaleout was put into a separate xml descriptor which embeds the original plain UIMA one. I'm sure Richard would be able to explain this better, but I think one of the core issues is that UIMA-AS embeds the XML descriptor instead of the AnalysisEngineDescription. So when I want to use it together with AnalysisEngineDescription built with UIMAfit instead, it's time to start making crazy workarounds like https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/component/SimpleService.java?name=14aeba50c8c1r=14aeba50c8c18ea4d14c0d099f43c049f806d9db * Connected with the above - I'd love .addToIndexes() to just disappear. Right now, the paradigm is that you build an annotation in an annotator, and the moment it gets saved in a CAS, it becomes basically read-only. You certainly can modify any of an Annotation's features subsequently. I'm guessing you're referring to another idea - adding additional features that were not initially defined in the UIMA type system. Sorry for the confusion, but that's not quite what I had in mind. I literally believe that right now, in order to modify value of a feature, you need to first remove it from an index, change the value, then re-add it back. Is that a misconception? UIMA sets up the types and features once at the start of the pipeline run (from a merge of all the component's type systems), and locks down the type system. Other frameworks sometimes allow an unlocked type system, where you could add (after a Feature Structure is created) additional features. This is usually done by keeping a list of feature-name - feature-value pairs (such as your code snippet does, below). We're thinking of including this capability in the version 3, with a bit of a twist - the intent would be to keep the compilable aspect of locked-down type/features (for high performance), while adding (for those use cases that want it) the other style of dynamically added additional features (at some cost in performance). Still, this would be awesome and I'd totally make use of it! (The code in my original email I guess conflates demonstration of two issues - the addToIndex and lack of variable-sized lists, i.e. the java collection support issue. Even if you decide generic collection / map support would be too tricky, at least supporting variable-sized lists would help a lot...) * I wondered about storing (arbitrary) graphs in the CAS, but the issues above make this really impractical. If you also think about integrating microformats, you need to think about how to do this. We have had users store arbitrary graphs in the CAS, but, yes, it is not so efficient. The main element UIMA has for collections of references (to FeatureStructures) are the FSArray and FSList. As you point out the FSArray is fixed length. The FSList supports dynamic adding/removing etc. using the standard link-list technology. However, because UIMA data in the CAS (currently) is not garbage collected, you have to be careful when using this technique. ...oh, never mind. After using UIMA heavily for well over a year, I managed not to learn that FSList exists at all! Thanks for this pointer. I think that's a bug for the UIMA Tutorial, which mentions FSArray but not FSList. :-) (Another pain point here - I always ache when I need to work with FSArray or I guess FSList, since it does not carry the type information that is in the typesystem - I
Re: UIMAj3 ideas
Hi! On Fri, Jul 10, 2015 at 10:28:08AM -0400, Eddie Epstein wrote: Good comments which will likely generate lots of responses. For now please see comments on scaleout below. On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis pa...@ucw.cz wrote: * UIMAfit is not part of core UIMA and UIMA-AS is not part of core UIMA. It seems to me that UIMA-AS is doing things a bit differently than what the original UIMA idea of doing scaleout was. The two things don't play well together. I'd love a way to easily take my plain UIMA pipeline and scale it out, ideally without any code changes, *and* avoid the terrible XML config files. Not clear what you are referring to as the original UIMA idea of doing scaleout, the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS is also an embeddable framework that offers flexible vertical (multi-threading) and horizontal (multi-process) options for deploying an arbitrary pipeline. Admittedly scaleout with UIMA-AS is complicated and the minimal support for process management make it difficult to do scaleout simply. In what ways do you think UIMA-AS is inconsistent with UIMA or UIMA scaleout? Well, my impression after delving into some UIMA internals was that the original idea was to use the Analysis Structure Broker to control the pipeline flow and it would seem natural that when doing scale-out, one would simply provide a different ASB. Its javadoc even reads The Analysis Structure Broker (codeASB/code) is the component responsible for the details of communicating with Analysis Engines that may potentially be distributed across different physical machines. Of course, maybe I got it wrong. DUCC is full cluster management application that will scaleout a plain UIMA pipeline with no code changes, assuming that the application code is threadsafe. But a typical pipeline with a single collection reader creating input CASes and a single cas consumer will limit scaleout performance pretty quickly. DUCC makes it easyto eliminate the input data bottleneck. DUCC sample apps show one approach to eliminating the output bottleneck. Have you looked at DUCC? I use UIMA pipeline for question answering, where each question currently takes ~30s (single-threaded) to process (a lot of it spent waiting on databases), so I don't think I'd hit such a bottleneck. I did spend a few tens of minutes looking at DUCC, but I got the impression that it's not really trivial to set up. One of my goals is to minimize setup hassles for anyone who wants to run my software - ideally, they should be able to just compile and run. If I started to use DUCC, I'm not sure to what degree I could preserve this, but at least it's another element in the already steep learning curve for anyone who wants to tinker with the system. (Then there's this whole issue of UIMA-AS vs. UIMAfit and in-memory resource sharing - though from one of your previous emails, I got the impression that I could run multiple AEs in threads of a single java process; but I guess at that point I was already decided that I want to try something less complex.) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: UIMAj3 ideas
Hi! On Thu, Jul 16, 2015 at 07:42:58PM +, Thomas Ginter wrote: Have you looked into using Leo? It allows you to programmatically create Analysis Engines, Aggregates, the type system, and launch everything in UIMA-AS without having to manage any XML descriptors at all. Furthermore it is available via Maven so your code can compile an run. http://department-of-veterans-affairs.github.io/Leo/userguide.html I had a look, but got the impression that I'd have to rewrite most of my pipeline generation code, and it's not small code. Also, it's not clear to me from Leo's docs whether and/or how it supports CAS multipliers and mergers, there seem to be no references to that. This impression might have been wrong, but overally I'd just welcome if I could stick with stock UIMA for scaleout at least in the form of multi-threading without cluster scaleout (which I think many UIMA users would welcome, and much smaller percentage wants to deploy to a cluster), that's what I was trying to say originally. -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: UIMAj3 ideas
On Thu, Jul 16, 2015 at 08:00:35PM +0200, Richard Eckart de Castilho wrote: On 16.07.2015, at 18:52, Petr Baudis pa...@ucw.cz wrote: Sorry for the confusion, but that's not quite what I had in mind. I literally believe that right now, in order to modify value of a feature, you need to first remove it from an index, change the value, then re-add it back. Is that a misconception? Well, yes and no. Yes, it was required for the case where the value that you changed was on a feature that was part of some index. No, it should no longer be required as measures have been implemented to handle this automatically. See: The curious case of the zombie annotation aka UIMA-4049 https://issues.apache.org/jira/browse/UIMA-4049 That's great to hear! However, when reading the bug report and looking closely at that part of the release notes, I think it should no longer be required isn't quite precise as changing indexed features might cause an exception to be thrown by an iterator that goes through these at the same time (so the fix for that is to use a snapshot iterator, and that sounds reasonable, more so when JCasUtil gets support for them - sorry if it did and I missed it, I'm still stuck on UIMA 2.6 for now anyway until the next release with fixed CasCopier). I think that's a bug for the UIMA Tutorial, which mentions FSArray but not FSList. :-) Then I should tell you also about the uimaFIT FSCollectionFactory which contains all kinds of helpers to manage FSArray and FSList ;) Btw. there is also ArrayFS which is the CAS version of FSArray :P .. Did you know that uimaFIT JCasUtil.select() can also be applied to FSList and FSArray to avoid casting? for (Token t : JCasUtil.select(sentence.getTokens(), Token.class) { ... } CasUtil.select() can work also on ArrayFS So many great news! Thanks so much for these. We'll certainly start using them in new code. :-) -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: [ANN] Multi-threaded UIMA ASB
On Thu, Jul 09, 2015 at 04:17:44PM -0400, Marshall Schor wrote: Hi, just saw this ... I'll take a look. This kind of thing is on the list for uima v3; see https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3 Thanks, I was not aware of that page. However, it seems to concern a much harder case of annotators working in parallel on the same CAS. I'm solving an easy case where each CAS is processed by just a single annotator at once. For this, there are thankfully no large changes in current UIMA needed, apparently, if one accepts a few rough corners (as documented). -- Petr Baudis If you have good ideas, good data and fast computers, you can do almost anything. -- Geoffrey Hinton
Re: Multi-threaded UIMA ParallelStep
Hi! On Wed, May 20, 2015 at 07:56:33AM -0400, Eddie Epstein wrote: Parallel-step currently only works with remote delegates. The other approach, using CasMultipliers, allows an arbitrarily amount of parallel processing in-process. A CM would create a separate CAS for each delegate intended to run in parallel, and use a feature structure to hold a unique identifier in each child CAS which a custom flow controller would use to direct these CASes to the desired delegates. Results for the parallel flows could be merged in a CasConsumer back into the parent CAS or to some other output. Thanks for that hint. However, I'm not sure how a flow controller could direct CASes to delegates? As far as I understand it, the flow controller decides which AE processes the CAS next, but cannot control the actual parallel execution of the flow, which would need to be taken care by the ASB (Analysis Structure Broker), and that would be the thing to hack in this case. Am I missing something? Thanks, Petr Baudis
Multi-threaded UIMA ParallelStep
Hi! I'm looking into ways to run a part of my pipeline multi-threaded: .- Multip0 - A1 - Multip1 - A2 -. reader - A0CASmerger `- Multip2 - A3 A2 -' ^^ ParallelStep is generated for each branch in a custom flow controller Basically, I need a way to tell UIMA to run each ParallelStep (which normally just denotes the CAS flow) truly in parallel. I have two constraints: (i) I'm using UIMAfit heavily, and multiple CAS multipliers and mergers (even within the parallel branches). So I can't use CPE. (ii) I need multi-threading, not separate processes. (I have just a meager 24G RAM (sigh) and one Java process with all the linguistic models and stuff loaded takes 3GB RAM. So I really need to load these resources to memory only once.) I looked into UIMA-AS, including Richard's helpful DKpro-lab code sample, but I can't figure out how to make it reasonably work with a *complex* UIMAfit pipeline that spans many branches and many analysis engines - it seems to me that I would need some centralized places where to specify it, and basically completely rewrite my pipeline building code (to the worse, in my impression). ...and I'm not even sure, from reading UIMA-AS code, if I could make it run in multiple threads within a single process! From comments in org/apache/uima/aae/controller/AggregateAnalysisEngineController_impl.java:parallelStep() I'm getting an impression that non-remote AEs will be executed serially after all, not in parallel. Is that correct? So going back to the original UIMA code, it seems to me that the thing to do would be replacing ASB_impl with my own copy (inheritance would not cut it the way it's coded), AggregateAnalysisEngine_impl with my own specialization or copy (as ASB_impl usage is hardcoded there) and rewrite the while() loop in ParallelStep case of ASB's processUntilNextOutputCas() to run in parallel. And hope I didn't miss any catch... Is there an option I'm missing? Any hints would be really appreciated! Thanks, Petr Baudis
UIMAFit vs. LEO
Hi! On Thu, May 14, 2015 at 05:44:12PM +, Thomas Ginter wrote: There is also Leo which allows you to programmatically create pipelines, launch them as UIMA-AS services, and manage types systems and clients without having to touch any descriptor files. You can find documentation at the site below: http://department-of-veterans-affairs.github.io/Leo/userguide.html I'm wondering how does UIMAFit and LEO fit together. My impression right now is: * They both have the same goal. * Mixing them in the same pipeline might get messy(?) * LEO advantage is that it seamlessly works with UIMA-AS (in fact it's built around UIMA-AS). * UIMAFit advantage is (if nothing else) vastly wider ecosystem. Did I get this about right? Thanks, Petr Baudis
Re: looking for lots of example UIMA code
On Wed, Oct 29, 2014 at 12:43:41PM -0400, Kameron Cole wrote: Thanks for the references. As for the samples on the UIMA sight, this is quite a find. I have been on this site for 10 years now, and never really stumbled across it. Just to be sure, this is where I am finding the most useful examples: http://svn.apache.org/viewvc/uima/uimaj/tags/uimaj-2.6.0/uimaj-examples/src/main/java/org/apache/uima/examples/ Am I missing anything? Also make sure you look at uimafit. It makes buildling pipelines so much easier, and also has some examples. Thanks also to Sergey. My main interest these days is inter-leaving UIMA code in the custom stages of IBM Watson Content Analytics. That leaves the arduous work of making annotations to the wizard style development environment of WCA Studio, and the UIMA portion I use for call outs to other programs. If you are looking for example of UIMA pipeline code rather than UIMA annotator code, https://github.com/brmson/yodaqa has a moderately interesting branched CAS pipeline. I had quite a lot of trouble finding other open source code examples that implement a non-linear pipeline. Petr Baudis
Re: Restricting a aggregate engine to a substring or mention
On Tue, Jun 17, 2014 at 06:48:15PM +, Oliver Christ wrote: dkpro-core's BreakIteratorSegmenter (rather: its base class) takes the same approach. It allows you to specify that segmentation should occur within zones, defined by some other annotation type. And for most other dkpro-core's annotators adding other linguistic features, it is thankfully typically fine to just prune the Sentence annotations to the areas you want annotated. That's the approach I'm using when I first pre-filter a document for interesting sentences, then copy just these over to another view and run the taggers and parsers on just these. Petr Pasky Baudis
Parallel Flow Controller?
Hi! In my UIMA pipeline, at a few points I have a need for some AEs to be executed logically in parallel - in particular, I'd need this in case of a few CAS multipliers. If I understand things correctly, there is no way with the fixed flow controller to execute two CAS multipliers in parallel, i.e. both using a single source CAS, dropping it and producing a bunch of new CASes. I need to create a CAS processing graph like: .- Multip0 - A1 - Multip1 - A2 -. reader - A0CASmerger `- Multip2 - A3 A2 -' My current aim would be enclosing each of the branches (up to A2) in an aggregate AE, and creating another aggregate AE that will consist of these two branch AEs, governed by a custom parallel flow controller that will ensure the input CAS is fed as input to both branches and the union of output CAS of both branches is sent out of the aggregate AE: Main: reader - A0 - AggregP - A2 - CASmerger AggregP: Aggreg0, Aggreg1 (ParallelFlowController) Aggreg0: Multip0 - A1 - Multip1 Aggreg1: Multip2 - A3 I'd just like to confirm whether noone implemented the parallel flow controller yet and if perhaps I'm not missing a simple existing solution to this problem. Kind regards, Petr Pasky Baudis
Copying a CAS subset with offset correction
Hi! I'm trying to figure out how to reliably do deep copies from one CAS to another where the sofa of the target CAS is a subset of the source CAS. E.g. copying from the previous sentence to do deep copies from one CAS to another. One approach is to simply do something like int ofs = subCasSpan.getBegin(); CasCopier copier = new CasCopier(srcCas.getCas(), dstCas.getCas()); for (Annotation a : JCasUtil.selectCovered(Annotation.class, subCasSpan)) { Annotation a2 = (Annotation) copier.copyFs(a); a2.setBegin(a2.getBegin() - ofs); a2.setEnd(a2.getEnd() - ofs); a2.addToIndexes(); } However, the problem is when the featureset contains references to other featuresets; if these are outside the span, their offsets will not get modifies and these hidden featuresets will remain referenced but become nonsensical and misleading, instead of ideally the featuresets not being copied and replaced by null references. I don't think this is something that's easily achievable right now? (The possible annotation types are an open set, manual per-annotation handling of references is not feasible in my case.) I think the most reasonable solution would be to introduce a way to specify an offset span for the CasCopier (or a subclass), with annotations dropped if they are outside of the offset span? Thanks, Petr Pasky Baudis
Re: Deduplicating Annotations With Same coveredText
Hi! On Tue, Apr 22, 2014 at 05:10:56PM -0400, Marshall Schor wrote: If you plan on running your pipeline in one JVM (rather than having it scaled out over multiple JVMs), you can consider using an external resource which would be a plain Java SetString of the unique covered text so far found. Then, in the annotator (or annotators) that are adding new FeatureStructures representing the possibly duplication annotation, you can first check the shared resource to see if its been already annotated, and if so, skip both creating the additional FeatureStructure, and adding it to the indexes. Would that work for your use case? That's an interesting approach, thanks for the suggestion. While I could do it this way now, I plan to scale out my setup to multiple machines in the future and this solution would become inconvenient then. For the time being, I have simply loaded all the FSes to a coveredText-addressed map and then removed duplicates. Petr Pasky Baudis
Re: CAS Multiplier usage in UIMAfit
Hi! On Wed, Apr 16, 2014 at 03:26:54PM +0900, Hugo Mougard wrote: I'm trying to use a multiplier to discard some CASes based on some annotation. It currently doesn't work (the CASes are not discarded). I also noticed several tickets opened on the suject of multipliers and am therefore not sure if it's currently possible to use them in UIMAfit. Perhaps a better solution exists meanwhile, but some time ago, Philipp W suggested on this mailing list a SimplePipeline replacement that can deal with CAS multipliers: https://groups.google.com/forum/#!topic/uimafit-users/yA0w2Q8tGNE I had to wrap it up in an actual class and fix it for Aggregate Engines, my version is at: https://github.com/brmson/yodaqa/blob/master/src/main/java/cz/brmlab/yodaqa/flow/MultiCASPipeline.java You just use it in the same way as you'd use SimplePipeline then, e.g.: https://github.com/brmson/yodaqa/blob/9e12a80c/src/main/java/cz/brmlab/yodaqa/YodaQAApp.java P.S.: I think ideally, to enable better scale-out and for consistency if you are using other Aggregate Engines anyway, you would probably create a single aggregate engine for your pipeline with the proper flow controller setup within, setting FlowController's ActionForIntermediateSegments to drop. In XML CPE descriptor you'd do that like this: https://github.com/brmson/yodaqa/blob/bad64d5c/src/main/resources/cz/brmlab/yodaqa/pipeline/YodaQA.xml If you come up with a way to do that in UIMAfit, I will be glad if you'd share a working code snippet. Petr Pasky Baudis