Re: Who is using the Maven uimaFIT plugin in open source?

2016-02-04 Thread Petr Baudis
  Hi!

On Thu, Feb 04, 2016 at 10:11:00AM +0100, Richard Eckart de Castilho wrote:
> I am looking for open source projects or at least publicly
> distributed components that are using UIMA in conjunction
> with Maven and with the uimaFIT Maven plugin.
> 
> If you know or have such a project, it would be great if
> you could post a link here.

  https://github.com/brmson/blanqa is not developed anymore, but uses
Maven + uimaFIT.  Some other OpenQA/OQA components prolly do too.

  https://github.com/brmson/yodaqa uses gradle + uimaFIT (from the maven
repo), not sure if that qualifies. :-)

-- 
        Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton


Re: Basic UIMA questions

2016-01-14 Thread Petr Baudis
  Hi!

On Thu, Jan 14, 2016 at 02:09:07PM +, Sean Crist wrote:
> I have a few questions on the basic concepts of UIMA.  It’s fine if you tell 
> me to read the manuals, but I haven’t been able to find the answers there so 
> far, so a chapter reference would be a big help.
> 
> 
> 
> 1)If Annotator A creates an annotation, is it OK for Annotator B to 
> modify the information in the annotations which A created?

  Yes, that's fine.  (I hope - maybe the rules change a little in
distributed environment, and for some reason I always reindex the
annotations, but that might not be necessary anymore - I'll let someone
else fill in the details here.)

> 2)   I’ve read that an annotation can contain a reference to another 
> annotation, but I haven’t been able to find instructions or an example.
> 
> Possibly, I could generate the annotation class using JCasGen, and then 
> manually augment the auto-generated code to support references to other 
> annotation objects.  Is that a good way to do it?  Or is there some kind of 
> built-in support?

  Sure, the feature type does not need to be a primitive UIMA type like

uima.cas.Integer

but also a reference to another featureset type like

uima.tcas.Annotation

(reference to an unspecified type of annotation) or


de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Tokeno

(reference to a particular type of annotation).  JCas then handles all
the resolution for you and the get...() function will return an instance
of the correct JCas class of the referenced annotation.

> 3)   Suppose I want a parser to build a parse tree over tokens.  A parse tree 
> consists of a hierarchy of nodes.
> 
> I could represent each node as an annotation.  Is that the most UIMA-like 
> solution?
> 
> The reason I hesitate is this.  If I were writing a non-UIMA solution from 
> scratch, I’d treat all of the nodes above the token level as abstract units, 
> and those abstract units wouldn’t deal in concrete information such as the 
> beginning and end of a character range.  I’d keep track of that only at the 
> token level.  I think that all UIMA annotations are required to keep track of 
> this information.
> 
> Also, it sounds the only way for an annotator to retrieve existing 
> annotations is to create an iterator and pull them out one by one.  I wish 
> there were a way to just get a reference to the root node of my parse tree, 
> so that I can simply step recursively through the tree (which assumes I’ve 
> arranged for each node to contain references to its children).

  Yes, you would represent each node as an annotation - or rather, each
edge as an annotation (typically annotating the "receiving end").
That's exactly how e.g. DKpro does it when wrapping StanfordParser.

  It's not really painful to work with the tree this way, see e.g.


https://github.com/brmson/yodaqa/blob/master/src/main/java/cz/brmlab/yodaqa/analysis/question/FocusGenerator.java

for an example of code that applies a simple set of blackboard rules
to a parse tree to find a focus of a question sentence.

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton


Re: [UK OFFICIAL] Baleen - UIMA Based Text Analytics Framework

2015-09-28 Thread Petr Baudis
  Hi!

On Mon, Sep 28, 2015 at 02:31:03PM +0100, Baker James D wrote:
> I would like to draw your attention to a text analytics framework that has 
> just been released by Dstl (part of the UK Ministry of Defence). It uses UIMA 
> as part of its underlying architecture but provides additional functionality 
> on top of that, and simplifies much of the user configuration and experience, 
> as well as the development process. A number of collection readers, 
> annotators and consumers are included as part of the framework.
> 
> The tool is called Baleen, and is released under Apache Software License 2.
> 
> There is more information about the tool on the press release 
> (https://www.gov.uk/government/news/dstl-adds-to-open-source-software), and 
> on the GitHub page (https://github.com/dstl/baleen).

  Thanks for the heads up.  However, I haven't found any clear summary
of what is the framework capable of right now - I think you might want
to expand the generic description a bit with some examples and
use-cases.  I have been looking around a bit and seems like e.g.


https://github.com/dstl/baleen/blob/master/baleen/baleen-annotators/src/main/java/uk/gov/dstl/baleen/annotators/cleaners/MergeAdjacentQuantities.java

is something that could be pretty useful, but you might want to make it
easier to discover the capabilities to get more users / contributors.

  Best,

    Petr Baudis


Re: CAS merger/multiplier N:M mapping

2015-09-06 Thread Petr Baudis
  Hi!

On Sun, Sep 06, 2015 at 10:58:44AM -0400, Eddie Epstein wrote:
> On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis <pa...@ucw.cz> wrote:
> >   (ii) Use an internal "intermediary" CAS instance in process() to which
> > I append my sentences, then use it as a source of output CASes.  Turns
> > out (surprisingly) that I can't append to a sofa documenttext ("Data for
> > Sofa feature setLocalSofaData() has already been set." - not sure about
> > the reason for this restriction).
> >
> 
> The Sofa data for a view is immutable, otherwise existing annotations
> could become invalid.

  But in my case, I'd only append to the end, so this concern is moot.

  It's rather easy anyway to make your annotations go invalid if you use
CasCopier a bit.

> >   I think the only choice except downright unmaintainable hacks (like
> > programatically generated M views) is to just give up on preserving my
> > annotations and carry over just the sentence texts.  Am I missing
> > something?
> >
> 
> Creating a new view in the intermediate CAS for each of the N input CASes
> would work. A new output CAS Sofa would be comprised of data from
> multiple views and of course the annotation end points adjusted as when
> added to the new output CAS.

  I guess that .getViewIterator() would make this not so frustrating,
so I'll try this route, thanks for the tip!

> One problem there is that the intermediate CAS would continue to grow
> in size, so there would need to be some point when it could be reset.

  Indeed, well, when you output all M CASes is a good point.
I assume .release() would accomplish this.

> >   (I'm somewhat tempted to cut my losses short (much too late) and
> > abandon UIMA flow control altogether, using only simple pipelines and
> > having custom glue code to connect these together, as it seems like
> > getting the flow to work in interesting cases is a huge time sink and in
> > retrospect, it could never pay off any abstract advantage of easier
> > distributed processing (where you probably end up having to chop up the
> > pipeline manually anyway).  I would probably never recommend new UIMA
> > users to strive for a single pipeline with CAS multipliers/mergers and
> > begin to consider these features an evolutionary dead end rather than
> > advantageous.  Not sure if there even *are* any other real users using
> > advanced flows besides me and DeepQA.  I'll be glad to hear any opinions
> > on this!)
> >
> >
> Definitely the advantage to encapsulating analytics in standard UIMA
> components is easy scalability via the vertical and horizontal scale out
> options offered by UIMA-AS and DUCC. Flexibility in chopping up a
> pipeline into services as needed is another advantage.

  But as far as I understand, you need to explicitly define and deploy
AEs that are to be run on different machines anyway.  So I'm not sure if
the extra value is really that large in the end?

> The previously mentioned GALE multimodal application also converted
> sequences of N input CASes to M output CASes. In that case the input
> CASes represented 2 minutes worth of speech-to-text transcription of
> broadcast news, and each output CAS represented a single news story.
> The story-CASes then went thru a pipeline that identified the story and
> updated a pre-existing summarization for each story.

  Interesting (and good to hear), thanks!

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton


Re: UIMAj3 ideas

2015-07-16 Thread Petr Baudis
On Fri, Jul 10, 2015 at 01:37:27PM -0400, Marshall Schor wrote:
 On 7/9/2015 6:52 PM, Petr Baudis wrote:
 snip...
 
 https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3
 
I didn't figure out how to edit that wiki page, 
 Due to spammers, we had to turn off public editing.  However, I can add you 
 to a
 list ( to do this, you have to register for a user id on the wiki, and then
 send me offline what that Id is ), but even without being on the list, 
 there's a
 comment button which (I think) lets you add comments at the bottom.
  but a mental summary
  of the things I find currently irritating about UIMA and would love to
  see changed formed in my mind, so I thought I could contribute it for
  discussion.
 Great!
 
* UIMAfit is not part of core UIMA and UIMA-AS is not part of core
  UIMA.  It seems to me that UIMA-AS is doing things a bit differently
  than what the original UIMA idea of doing scaleout was.  The two
  things don't play well together.  I'd love a way to easily take
  my plain UIMA pipeline and scale it out, ideally without any code
  changes, *and* avoid the terrible XML config files.
 Any specifics of what to change here would be helpful.  UIMA-AS was designed 
 to
 enable scale-out without changing the core UIMA pipeline or it's XML
 descriptor.  THe additional information for UIMA-AS scaleout was put into a
 separate xml descriptor which embeds the original plain UIMA one.

  I'm sure Richard would be able to explain this better, but I think one
of the core issues is that UIMA-AS embeds the XML descriptor instead of
the AnalysisEngineDescription.  So when I want to use it together with
AnalysisEngineDescription built with UIMAfit instead, it's time to
start making crazy workarounds like


https://code.google.com/p/dkpro-lab/source/browse/de.tudarmstadt.ukp.dkpro.lab/de.tudarmstadt.ukp.dkpro.lab.uima.engine.uimaas/src/main/java/de/tudarmstadt/ukp/dkpro/lab/uima/engine/uimaas/component/SimpleService.java?name=14aeba50c8c1r=14aeba50c8c18ea4d14c0d099f43c049f806d9db

* Connected with the above - I'd love .addToIndexes() to just
  disappear.  Right now, the paradigm is that you build an annotation
  in an annotator, and the moment it gets saved in a CAS, it becomes
  basically read-only.  
 You certainly can modify any of an Annotation's features subsequently.
 I'm guessing you're referring to another idea - adding additional features 
 that were
 not initially defined in the UIMA type system.

  Sorry for the confusion, but that's not quite what I had in mind.
I literally believe that right now, in order to modify value of
a feature, you need to first remove it from an index, change the
value, then re-add it back.  Is that a misconception?

 UIMA sets up the types and
 features once at the start of the pipeline run (from a merge of all the
 component's type systems), and locks down the type system.  Other frameworks
 sometimes allow an unlocked type system, where you could add (after a Feature
 Structure is created) additional features.  This is usually done by keeping a
 list of feature-name - feature-value pairs (such as your code snippet does,
 below).  We're thinking of including this capability in the version 3, with a
 bit of a twist - the intent would be to keep the compilable aspect of
 locked-down type/features (for high performance), while adding (for those 
 use
 cases that want it) the other style of dynamically added additional features 
 (at
 some cost in performance).  

  Still, this would be awesome and I'd totally make use of it!

  (The code in my original email I guess conflates demonstration of two
issues - the addToIndex and lack of variable-sized lists, i.e. the java
collection support issue.  Even if you decide generic collection / map
support would be too tricky, at least supporting variable-sized lists
would help a lot...)

* I wondered about storing (arbitrary) graphs in the CAS, but the
  issues above make this really impractical.  If you also think about
  integrating microformats, you need to think about how to do this.
 We have had users store arbitrary graphs in the CAS, but, yes, it is not so
 efficient.  The main element UIMA has for collections of references (to
 FeatureStructures) are the FSArray and FSList.  As you point out the FSArray 
 is
 fixed length.  The FSList supports dynamic adding/removing etc. using the
 standard link-list technology.  However, because UIMA data in the CAS
 (currently) is not garbage collected, you have to be careful when using this
 technique.

  ...oh, never mind.  After using UIMA heavily for well over a year,
I managed not to learn that FSList exists at all!  Thanks for this
pointer.

  I think that's a bug for the UIMA Tutorial, which mentions FSArray but
not FSList.  :-)

  (Another pain point here - I always ache when I need to work with
FSArray or I guess FSList, since it does not carry the type information
that is in the typesystem - I

Re: UIMAj3 ideas

2015-07-16 Thread Petr Baudis
  Hi!

On Fri, Jul 10, 2015 at 10:28:08AM -0400, Eddie Epstein wrote:
 Good comments which will likely generate lots of responses.
 For now please see comments on scaleout below.
 
 On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis pa...@ucw.cz wrote:
 
* UIMAfit is not part of core UIMA and UIMA-AS is not part of core
  UIMA.  It seems to me that UIMA-AS is doing things a bit differently
  than what the original UIMA idea of doing scaleout was.  The two
  things don't play well together.  I'd love a way to easily take
  my plain UIMA pipeline and scale it out, ideally without any code
  changes, *and* avoid the terrible XML config files.
 
 
 Not clear what you are referring to as the original UIMA idea of doing
 scaleout,
 the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS
 is also an embeddable framework that offers flexible vertical
 (multi-threading) and
 horizontal (multi-process) options for deploying an arbitrary pipeline.
 Admittedly
 scaleout with UIMA-AS is complicated and the minimal support for process
 management make it difficult to do scaleout simply. In what ways do you
 think
 UIMA-AS is inconsistent with UIMA or UIMA scaleout?

  Well, my impression after delving into some UIMA internals was that
the original idea was to use the Analysis Structure Broker to control
the pipeline flow and it would seem natural that when doing scale-out,
one would simply provide a different ASB.  Its javadoc even reads

 The Analysis Structure Broker (codeASB/code) is the component
 responsible for the details of communicating with Analysis Engines
 that may potentially be distributed across different physical
 machines.

Of course, maybe I got it wrong.

 DUCC is full cluster management application that will scaleout a plain UIMA
 pipeline with no code changes, assuming that the application code is
 threadsafe.
 But a typical pipeline with a single collection reader creating input CASes
 and
 a single cas consumer will limit scaleout performance pretty quickly. DUCC
 makes it easyto eliminate the input data bottleneck. DUCC sample apps
 show one approach to eliminating the output bottleneck. Have you looked at
 DUCC?

  I use UIMA pipeline for question answering, where each question
currently takes ~30s (single-threaded) to process (a lot of it spent
waiting on databases), so I don't think I'd hit such a bottleneck.
I did spend a few tens of minutes looking at DUCC, but I got the
impression that it's not really trivial to set up.

  One of my goals is to minimize setup hassles for anyone who wants to
run my software - ideally, they should be able to just compile and run.
If I started to use DUCC, I'm not sure to what degree I could preserve
this, but at least it's another element in the already steep learning
curve for anyone who wants to tinker with the system.

  (Then there's this whole issue of UIMA-AS vs. UIMAfit and in-memory
resource sharing - though from one of your previous emails, I got the
impression that I could run multiple AEs in threads of a single java
process; but I guess at that point I was already decided that I want
to try something less complex.)

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton


Re: UIMAj3 ideas

2015-07-16 Thread Petr Baudis
  Hi!

On Thu, Jul 16, 2015 at 07:42:58PM +, Thomas Ginter wrote:
 Have you looked into using Leo?  It allows you to programmatically create 
 Analysis Engines, Aggregates, the type system, and launch everything in 
 UIMA-AS without having to manage any XML descriptors at all.  Furthermore it 
 is available via Maven so your code can compile an run.  
 
 http://department-of-veterans-affairs.github.io/Leo/userguide.html

  I had a look, but got the impression that I'd have to rewrite most
of my pipeline generation code, and it's not small code.  Also, it's
not clear to me from Leo's docs whether and/or how it supports CAS
multipliers and mergers, there seem to be no references to that.

  This impression might have been wrong, but overally I'd just welcome
if I could stick with stock UIMA for scaleout at least in the form
of multi-threading without cluster scaleout (which I think many UIMA
users would welcome, and much smaller percentage wants to deploy to
a cluster), that's what I was trying to say originally.

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton


Re: UIMAj3 ideas

2015-07-16 Thread Petr Baudis
On Thu, Jul 16, 2015 at 08:00:35PM +0200, Richard Eckart de Castilho wrote:
 On 16.07.2015, at 18:52, Petr Baudis pa...@ucw.cz wrote:
   Sorry for the confusion, but that's not quite what I had in mind.
  I literally believe that right now, in order to modify value of
  a feature, you need to first remove it from an index, change the
  value, then re-add it back.  Is that a misconception?
 
 Well, yes and no. Yes, it was required for the case where the value that
 you changed was on a feature that was part of some index. No, it should
 no longer be required as measures have been implemented to handle this
 automatically.
 
 See: The curious case of the zombie annotation aka UIMA-4049
 
 https://issues.apache.org/jira/browse/UIMA-4049

  That's great to hear!  However, when reading the bug report and
looking closely at that part of the release notes, I think it should no
longer be required isn't quite precise as changing indexed features
might cause an exception to be thrown by an iterator that goes through
these at the same time (so the fix for that is to use a snapshot
iterator, and that sounds reasonable, more so when JCasUtil gets support
for them - sorry if it did and I missed it, I'm still stuck on UIMA 2.6
for now anyway until the next release with fixed CasCopier).

   I think that's a bug for the UIMA Tutorial, which mentions FSArray but
  not FSList.  :-)
 
 Then I should tell you also about the uimaFIT FSCollectionFactory which
 contains all kinds of helpers to manage FSArray and FSList ;)
 
 Btw. there is also ArrayFS which is the CAS version of FSArray :P
..
 Did you know that uimaFIT JCasUtil.select() can also be applied to
 FSList and FSArray to avoid casting?
 
 for (Token t : JCasUtil.select(sentence.getTokens(), Token.class) {
   ...
 }
 
 CasUtil.select() can work also on ArrayFS

  So many great news! Thanks so much for these.  We'll certainly start
using them in new code. :-)

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton


Re: [ANN] Multi-threaded UIMA ASB

2015-07-09 Thread Petr Baudis
On Thu, Jul 09, 2015 at 04:17:44PM -0400, Marshall Schor wrote:
 Hi, just saw this ...
 
 I'll take a look.  This kind of thing is on the list for uima v3; see
 https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3

  Thanks, I was not aware of that page.

  However, it seems to concern a much harder case of annotators
working in parallel on the same CAS.  I'm solving an easy case where
each CAS is processed by just a single annotator at once.  For this,
there are thankfully no large changes in current UIMA needed,
apparently, if one accepts a few rough corners (as documented).

-- 
Petr Baudis
If you have good ideas, good data and fast computers,
you can do almost anything. -- Geoffrey Hinton


Re: Multi-threaded UIMA ParallelStep

2015-05-20 Thread Petr Baudis
  Hi!

On Wed, May 20, 2015 at 07:56:33AM -0400, Eddie Epstein wrote:
 Parallel-step currently only works with remote delegates. The other
 approach, using CasMultipliers, allows an arbitrarily amount of parallel
 processing in-process. A CM would create a separate CAS for each delegate
 intended to run in parallel, and use a feature structure to hold a unique
 identifier in each child CAS which a custom flow controller would use to
 direct these CASes to the desired delegates. Results for the parallel flows
 could be merged in a CasConsumer back into the parent CAS or to some other
 output.

  Thanks for that hint.  However, I'm not sure how a flow controller
could direct CASes to delegates?  As far as I understand it, the flow
controller decides which AE processes the CAS next, but cannot control
the actual parallel execution of the flow, which would need to be taken
care by the ASB (Analysis Structure Broker), and that would be the thing
to hack in this case.  Am I missing something?

  Thanks,

Petr Baudis


Multi-threaded UIMA ParallelStep

2015-05-19 Thread Petr Baudis
  Hi!

  I'm looking into ways to run a part of my pipeline multi-threaded:

.- Multip0 - A1 - Multip1 - A2 -.
  reader - A0CASmerger
`- Multip2 - A3  A2 -'
^^
ParallelStep is generated for each branch
in a custom flow controller

Basically, I need a way to tell UIMA to run each ParallelStep (which
normally just denotes the CAS flow) truly in parallel.  I have two
constraints:

  (i) I'm using UIMAfit heavily, and multiple CAS multipliers and
mergers (even within the parallel branches).  So I can't use CPE.

  (ii) I need multi-threading, not separate processes.  (I have just
a meager 24G RAM (sigh) and one Java process with all the linguistic
models and stuff loaded takes 3GB RAM.  So I really need to load these
resources to memory only once.)


  I looked into UIMA-AS, including Richard's helpful DKpro-lab code
sample, but I can't figure out how to make it reasonably work with
a *complex* UIMAfit pipeline that spans many branches and many
analysis engines - it seems to me that I would need some centralized
places where to specify it, and basically completely rewrite my pipeline
building code (to the worse, in my impression).

  ...and I'm not even sure, from reading UIMA-AS code, if I could make
it run in multiple threads within a single process!  From comments in


org/apache/uima/aae/controller/AggregateAnalysisEngineController_impl.java:parallelStep()

I'm getting an impression that non-remote AEs will be executed serially
after all, not in parallel.  Is that correct?


  So going back to the original UIMA code, it seems to me that the thing
to do would be replacing ASB_impl with my own copy (inheritance would
not cut it the way it's coded), AggregateAnalysisEngine_impl with my own
specialization or copy (as ASB_impl usage is hardcoded there) and
rewrite the while() loop in ParallelStep case of ASB's
processUntilNextOutputCas() to run in parallel.  And hope I didn't miss
any catch...


  Is there an option I'm missing?  Any hints would be really
appreciated!

  Thanks,

Petr Baudis


UIMAFit vs. LEO

2015-05-19 Thread Petr Baudis
  Hi!

On Thu, May 14, 2015 at 05:44:12PM +, Thomas Ginter wrote:
 There is also Leo which allows you to programmatically create pipelines, 
 launch them as UIMA-AS services, and manage types systems and clients without 
 having to touch any descriptor files.  You can find documentation at the site 
 below:
 
 http://department-of-veterans-affairs.github.io/Leo/userguide.html

  I'm wondering how does UIMAFit and LEO fit together.  My impression
right now is:

  * They both have the same goal.

  * Mixing them in the same pipeline might get messy(?)

  * LEO advantage is that it seamlessly works with UIMA-AS (in fact it's
built around UIMA-AS).

  * UIMAFit advantage is (if nothing else) vastly wider ecosystem.

  Did I get this about right?

  Thanks,

Petr Baudis


Re: looking for lots of example UIMA code

2014-10-29 Thread Petr Baudis
On Wed, Oct 29, 2014 at 12:43:41PM -0400, Kameron Cole wrote:
 Thanks for the references.  As for the samples on the UIMA sight, this is
 quite a find.  I have been on this site for 10 years now, and never really
 stumbled across it.  Just to be sure, this is where I am finding the most
 useful examples:
 http://svn.apache.org/viewvc/uima/uimaj/tags/uimaj-2.6.0/uimaj-examples/src/main/java/org/apache/uima/examples/
 
 Am I missing anything?

  Also make sure you look at uimafit.  It makes buildling pipelines so
much easier, and also has some examples.

 Thanks also to Sergey.  My main interest these days is inter-leaving UIMA
 code in the custom stages of IBM Watson Content Analytics.  That leaves the
 arduous work of making annotations to the wizard style development
 environment of WCA Studio, and the UIMA portion I use for call outs to
 other programs.

  If you are looking for example of UIMA pipeline code rather than
UIMA annotator code, https://github.com/brmson/yodaqa has a moderately
interesting branched CAS pipeline.  I had quite a lot of trouble finding
other open source code examples that implement a non-linear pipeline.

Petr Baudis


Re: Restricting a aggregate engine to a substring or mention

2014-06-17 Thread Petr Baudis
On Tue, Jun 17, 2014 at 06:48:15PM +, Oliver Christ wrote:
 dkpro-core's BreakIteratorSegmenter (rather: its base class) takes the same 
 approach. It allows you to specify that segmentation should occur within 
 zones, defined by some other annotation type.

  And for most other dkpro-core's annotators adding other linguistic
features, it is thankfully typically fine to just prune the Sentence
annotations to the areas you want annotated.

  That's the approach I'm using when I first pre-filter a document for
interesting sentences, then copy just these over to another view and
run the taggers and parsers on just these.

Petr Pasky Baudis


Parallel Flow Controller?

2014-05-12 Thread Petr Baudis
  Hi!

  In my UIMA pipeline, at a few points I have a need for
some AEs to be executed logically in parallel - in particular,
I'd need this in case of a few CAS multipliers. If I understand
things correctly, there is no way with the fixed flow controller
to execute two CAS multipliers in parallel, i.e. both using
a single source CAS, dropping it and producing a bunch of new
CASes.  I need to create a CAS processing graph like:

.- Multip0 - A1 - Multip1 - A2 -.
  reader - A0CASmerger
`- Multip2 - A3  A2 -'

  My current aim would be enclosing each of the branches (up to A2)
in an aggregate AE, and creating another aggregate AE that will
consist of these two branch AEs, governed by a custom parallel
flow controller that will ensure the input CAS is fed as input
to both branches and the union of output CAS of both branches
is sent out of the aggregate AE:

  Main: reader - A0 - AggregP - A2 - CASmerger
  AggregP: Aggreg0, Aggreg1 (ParallelFlowController)
  Aggreg0: Multip0 - A1 - Multip1
  Aggreg1: Multip2 - A3

  I'd just like to confirm whether noone implemented the parallel
flow controller yet and if perhaps I'm not missing a simple existing
solution to this problem.

  Kind regards,

Petr Pasky Baudis



Copying a CAS subset with offset correction

2014-04-27 Thread Petr Baudis
  Hi!

  I'm trying to figure out how to reliably do deep copies from one CAS
to another where the sofa of the target CAS is a subset of the source
CAS. E.g. copying from the previous sentence to do deep copies from
one CAS to another.

  One approach is to simply do something like

int ofs = subCasSpan.getBegin();
CasCopier copier = new CasCopier(srcCas.getCas(), dstCas.getCas());
for (Annotation a : JCasUtil.selectCovered(Annotation.class, 
subCasSpan)) {
Annotation a2 = (Annotation) copier.copyFs(a);
a2.setBegin(a2.getBegin() - ofs);
a2.setEnd(a2.getEnd() - ofs);
a2.addToIndexes();
}

However, the problem is when the featureset contains references to other
featuresets; if these are outside the span, their offsets will not get
modifies and these hidden featuresets will remain referenced but
become nonsensical and misleading, instead of ideally the featuresets
not being copied and replaced by null references.

  I don't think this is something that's easily achievable right now?
(The possible annotation types are an open set, manual per-annotation
handling of references is not feasible in my case.)

  I think the most reasonable solution would be to introduce a way to
specify an offset span for the CasCopier (or a subclass), with
annotations dropped if they are outside of the offset span?

  Thanks,

Petr Pasky Baudis


Re: Deduplicating Annotations With Same coveredText

2014-04-23 Thread Petr Baudis
  Hi!

On Tue, Apr 22, 2014 at 05:10:56PM -0400, Marshall Schor wrote:
 If you plan on running your pipeline in one JVM (rather than having it scaled
 out over multiple JVMs), you can consider using an external resource which 
 would
 be a plain Java SetString of the unique covered text so far found.  Then, in
 the annotator (or annotators) that are adding new FeatureStructures 
 representing
 the possibly duplication annotation, you can first check the shared resource 
 to
 see if its been already annotated, and if so, skip both creating the 
 additional
 FeatureStructure, and adding it to the indexes.
 
 Would that work for your use case?

  That's an interesting approach, thanks for the suggestion.  While I
could do it this way now, I plan to scale out my setup to multiple
machines in the future and this solution would become inconvenient
then.  For the time being, I have simply loaded all the FSes to a
coveredText-addressed map and then removed duplicates.

Petr Pasky Baudis


Re: CAS Multiplier usage in UIMAfit

2014-04-16 Thread Petr Baudis
  Hi!

On Wed, Apr 16, 2014 at 03:26:54PM +0900, Hugo Mougard wrote:
 I'm trying to use a multiplier to discard some CASes based on some
 annotation. It currently doesn't work (the CASes are not discarded). I
 also noticed several tickets opened on the suject of multipliers and
 am therefore not sure if it's currently possible to use them in
 UIMAfit.

  Perhaps a better solution exists meanwhile, but some time ago,
Philipp W suggested on this mailing list a SimplePipeline replacement
that can deal with CAS multipliers:

https://groups.google.com/forum/#!topic/uimafit-users/yA0w2Q8tGNE

I had to wrap it up in an actual class and fix it for Aggregate Engines,
my version is at:


https://github.com/brmson/yodaqa/blob/master/src/main/java/cz/brmlab/yodaqa/flow/MultiCASPipeline.java

You just use it in the same way as you'd use SimplePipeline then, e.g.:


https://github.com/brmson/yodaqa/blob/9e12a80c/src/main/java/cz/brmlab/yodaqa/YodaQAApp.java


  P.S.: I think ideally, to enable better scale-out and for consistency
if you are using other Aggregate Engines anyway, you would probably
create a single aggregate engine for your pipeline with the proper flow
controller setup within, setting FlowController's ActionForIntermediateSegments
to drop.  In XML CPE descriptor you'd do that like this:


https://github.com/brmson/yodaqa/blob/bad64d5c/src/main/resources/cz/brmlab/yodaqa/pipeline/YodaQA.xml

If you come up with a way to do that in UIMAfit, I will be glad if you'd
share a working code snippet.

Petr Pasky Baudis