Re: Good practice for using and saving a resource built by previous annotators

Nicolas Hernandez Thu, 14 Nov 2013 12:25:03 -0800

Indeed, you can access a shared resource in the
collectionProcessComplete method. I wonder why I thought I could not.


So I was talking about shared resources and using uima-as to scale out.

Thanks Richard for your answer

On Thu, Nov 14, 2013 at 9:02 PM, Richard Eckart de Castilho
<r...@apache.org> wrote:
> On 14.11.2013, at 18:19, Nicolas Hernandez <nicolas.hernan...@gmail.com> 
> wrote:
>
>> Dear All
>>
>> Let say I want to count the occurrences of each word in a document
>> collection and to use these counters (possibly in the same workflow).
>> I am in the situation where I have a CAS per document and I want to
>> scale out the workflow.
>
> How do you scale it out?
>
>> To scale out the workflow I use a resource to store the counters of
>> each word. The resource is accessed in writing mode by several
>> instances of an annotator which process in parallel distinct CASes.
>
> What kind of resource do you use?
>
>> Here are my questions :
>> * I believe I cannot be sure that when a successive annotator in the
>> same workflow will use the resource, the resource would not still be
>> modified after that (by running counter annotators which are still
>> processing remaining CAS). Right ? In other words, I do not have a way
>> to run (to delay the run of) an annotator depending the state of a
>> resource ?
>
> You can customize the flow by writing your own workflow controller.
> But if that is supported depends on how you do your scaling.
>
>> * So, I may use two worflows: one to build the resource, the other one
>> to use it.  But how can I export/save the resource ? I cannot access
>> the resource in the collectionProcessComplete method of an AE, can I ?
>
> I would personally use the two workflows. Why do you believe that you cannot
> access the resource in collectionProcessComplete?
>
>> The solution I imagine was inspired of the use of the CAS multiplier
>> to merge CAS. It is to use two workflows with one of them dedicated to
>> build the resource. In this workflow, I define an annotator  (without
>> scaling out, so a cas consumer). In that annotator, I check the
>> SourceDocumentInformation Feature Structure in the CAS to see if its
>> lastSegment feature is set to true, in that case I can export the
>> resource. I know this it not a guarantee that all CAS have been
>> processed. I may also have a special counter resource in that
>> annotator to count the processed cas and eventually export the desired
>> resource when all CAS would have been processed. In that case, I would
>> need a way to communicate to the "exporter" annotator the number of
>> CAS which will be processed... This is not the main problem.
>>
>> After writing that, I realize that to do it in a single workflow, I
>> could have written a CAS multiplier to save each CAS until all have
>> been processed, then create again as many CAS as the ones saved...
>>
>> These solutions are very complex...
>>
>> Any suggestions... ? A uimaFIT trick =) ?
>
> Well, to do small-scale scaling using a CPE, I'd do this:
>
> - build an aggregate which generates the word counts
> - use a custom shared resource to do the counting
> - in the collectionProcessComplete call some synchronized "save" method on 
> the resource
> - if "save" is called the second time, it does nothing
>
> - build an aggregate which uses the word counts
>
> Run both workflows, one after the other using the CpePipeline of uimaFIT.
>
> -- Richard
>



-- 
Dr. Nicolas Hernandez
Associate Professor (Maître de Conférences)
Université de Nantes - LINA CNRS UMR 6241
http://enicolashernandez.blogspot.com
http://www.univ-nantes.fr/hernandez-n
+33 (0)2 51 12 53 94
+33 (0)2 40 30 60 67

Re: Good practice for using and saving a resource built by previous annotators

Reply via email to