Re: Anybody using UIMA DUCC? Care to give a hand?

2022-11-11 Thread Eddie Epstein
Hi Richard,

Our last DUCC cluster was retired earlier this year. I would vote for
retirement.

Regards,
Eddie

On Fri, Nov 11, 2022 at 10:38 AM Richard Eckart de Castilho 
wrote:

> Hi all,
>
> is anybody using UIMA DUCC?
>
> If yes, it would be great if you could lend us a hand in preparing a new
> release.
>
> If nobody steps up, then I will suggest to retire UIMA DUCC towards the
> end of Nov 2022 (in about two weeks).
>
> Cheers,
>
> -- Richard (with the Apache UIMA PMC Chair hat on)
>
>


Re: Recover or invalidate Collection Reader CAS

2022-08-26 Thread Eddie Epstein
The JMS service descriptor defines timeout:
https://uima.apache.org/d/uima-as-v2-current/uima_async_scaleout.html#ugr.async.ov.concepts.jms_descriptor

Error configuration:
https://uima.apache.org/d/uima-as-v2-current/uima_async_scaleout.html#ugr.ref.async.deploy.descriptor.errorconfig

Eddie

On Fri, Aug 26, 2022 at 10:01 AM Daniel Cosio  wrote:

> Any chance you could point me to where this is defined in the docs?
> Daniel Cosio
> dcco...@gmail.com
>
>
>
> > On Aug 26, 2022, at 8:52 AM, Eddie Epstein  wrote:
> >
> > UIMA-AS supports timeouts for remote annotators; the default timeout is
> > infinite. On timeout uima-as will take the action specified by error
> > handling configuration, but in any case the CAS sent to the remote will
> be
> > available for reuse.
> >
> > Eddie
> >
> > On Thu, Aug 25, 2022 at 12:05 PM Timo Boehme 
> > wrote:
> >
> >> PS: if the OS is killing the process because of lack of memory (typical
> >> case) it means the Java VM is allowed to used more (heap) memory as is
> >> available at this node. Maybe consider to adjust the memory setting for
> >> the Java process to prevent the OS kill. Then you may get an
> >> OutOfMemoryException which is bad too but the JavaVM might be able to do
> >> some cleanup/shutdown etc.
> >>
> >>
> >> Timo Boehme
> >>
> >>
> >>
> >> Am 25.08.22 um 17:55 schrieb Timo Boehme:
> >>> Hi Daniel,
> >>>
> >>> not using UIMA-AS myself but if the OS is killing a process because of
> >>> lack of resources it normally does so with a hard kill which does not
> >>> allow the Java VM process to do any shutdown work.
> >>> One would need a separate process controlling the Java one and react if
> >>> the Java VM is killed - however this won't help in getting the CAS
> >>> released (or the controlling process has specific UIMA knowledge). I
> >>> don't known if the UIMA-AS uses such a 2-process model per node but
> >>> assume it does not.
> >>>
> >>>
> >>> Regards,
> >>> Timo Boehme
> >>>
> >>>
> >>> Am 25.08.22 um 17:28 schrieb Daniel Cosio:
> >>>> Yes, this is uima-as-jms.. the pipeline gets a signal from the os and
> >>>> shutdown.. but we loose the CAS. Is there any api I can use to tell
> >>>> the collection reader to invalidate. I know the AE has
> >>>> A temp queue connection that communicates the CAS releases..I was
> >>>> wonder if there was any way of getting the temp queue connection and
> >>>> sending the message back to return the CAS.. Possible in a shutdown
> >> hook.
> >>>>
> >>>>
> >>>> Daniel Cosio
> >>>> dcco...@gmail.com
> >>>>
> >>>>
> >>>>
> >>>>> On Aug 25, 2022, at 9:20 AM, Eddie Epstein 
> >> wrote:
> >>>>>
> >>>>> Daniel, is this again a uima-as deployment? If so, since the OS kills
> >>>>> processes, is it some remote AE being killed?
> >>>>>
> >>>>> Eddie
> >>>>>
> >>>>> On Wed, Aug 24, 2022 at 10:04 AM Daniel Cosio 
> >> wrote:
> >>>>>
> >>>>>> Hi, I have some instances where the OS has killed a pipeline to
> >> recover
> >>>>>> resources.. When this happens the pipeline never returns the CAS to
> >> the
> >>>>>> reader so the reader now has 1 less CASes
> >>>>>> Available.. Is there a was to either
> >>>>>> 1. Add a shutdown hook on the pipeline to return the CAS if it gets
> a
> >>>>>> shutdown signal
> >>>>>> or
> >>>>>> 2. Set an expiration on the collection reader to expire a CAS that
> >>>>>> is not
> >>>>>> returns and issue a new one into the CAS pool
> >>>>>>
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>
> >>>>>> Daniel Cosio
> >>>>>> dcco...@gmail.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> OntoChem GmbH
> >> Blücherstraße 24
> >> 06120 Halle (Saale)
> >> Germany
> >>
> >> email: timo.boe...@ontochem.com | web: www.ontochem.com
> >> HRB 215461 Amtsgericht Stendal  | USt-IdNr.: DE246232735
> >> managing directors: Dr. Lutz Weber (CEO), Dr. Felix Berthelmann (COO)
> >>
> >>
>
>


Re: Recover or invalidate Collection Reader CAS

2022-08-26 Thread Eddie Epstein
UIMA-AS supports timeouts for remote annotators; the default timeout is
infinite. On timeout uima-as will take the action specified by error
handling configuration, but in any case the CAS sent to the remote will be
available for reuse.

Eddie

On Thu, Aug 25, 2022 at 12:05 PM Timo Boehme 
wrote:

> PS: if the OS is killing the process because of lack of memory (typical
> case) it means the Java VM is allowed to used more (heap) memory as is
> available at this node. Maybe consider to adjust the memory setting for
> the Java process to prevent the OS kill. Then you may get an
> OutOfMemoryException which is bad too but the JavaVM might be able to do
> some cleanup/shutdown etc.
>
>
> Timo Boehme
>
>
>
> Am 25.08.22 um 17:55 schrieb Timo Boehme:
> > Hi Daniel,
> >
> > not using UIMA-AS myself but if the OS is killing a process because of
> > lack of resources it normally does so with a hard kill which does not
> > allow the Java VM process to do any shutdown work.
> > One would need a separate process controlling the Java one and react if
> > the Java VM is killed - however this won't help in getting the CAS
> > released (or the controlling process has specific UIMA knowledge). I
> > don't known if the UIMA-AS uses such a 2-process model per node but
> > assume it does not.
> >
> >
> > Regards,
> > Timo Boehme
> >
> >
> > Am 25.08.22 um 17:28 schrieb Daniel Cosio:
> >> Yes, this is uima-as-jms.. the pipeline gets a signal from the os and
> >> shutdown.. but we loose the CAS. Is there any api I can use to tell
> >> the collection reader to invalidate. I know the AE has
> >> A temp queue connection that communicates the CAS releases..I was
> >> wonder if there was any way of getting the temp queue connection and
> >> sending the message back to return the CAS.. Possible in a shutdown
> hook.
> >>
> >>
> >> Daniel Cosio
> >> dcco...@gmail.com
> >>
> >>
> >>
> >>> On Aug 25, 2022, at 9:20 AM, Eddie Epstein 
> wrote:
> >>>
> >>> Daniel, is this again a uima-as deployment? If so, since the OS kills
> >>> processes, is it some remote AE being killed?
> >>>
> >>> Eddie
> >>>
> >>> On Wed, Aug 24, 2022 at 10:04 AM Daniel Cosio 
> wrote:
> >>>
> >>>> Hi, I have some instances where the OS has killed a pipeline to
> recover
> >>>> resources.. When this happens the pipeline never returns the CAS to
> the
> >>>> reader so the reader now has 1 less CASes
> >>>> Available.. Is there a was to either
> >>>> 1. Add a shutdown hook on the pipeline to return the CAS if it gets a
> >>>> shutdown signal
> >>>> or
> >>>> 2. Set an expiration on the collection reader to expire a CAS that
> >>>> is not
> >>>> returns and issue a new one into the CAS pool
> >>>>
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>> Daniel Cosio
> >>>> dcco...@gmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >
> >
>
>
> --
> OntoChem GmbH
> Blücherstraße 24
> 06120 Halle (Saale)
> Germany
>
> email: timo.boe...@ontochem.com | web: www.ontochem.com
> HRB 215461 Amtsgericht Stendal  | USt-IdNr.: DE246232735
> managing directors: Dr. Lutz Weber (CEO), Dr. Felix Berthelmann (COO)
>
>


Re: Recover or invalidate Collection Reader CAS

2022-08-25 Thread Eddie Epstein
Daniel, is this again a uima-as deployment? If so, since the OS kills
processes, is it some remote AE being killed?

Eddie

On Wed, Aug 24, 2022 at 10:04 AM Daniel Cosio  wrote:

> Hi, I have some instances where the OS has killed a pipeline to recover
> resources.. When this happens the pipeline never returns the CAS to the
> reader so the reader now has 1 less CASes
> Available.. Is there a was to either
> 1. Add a shutdown hook on the pipeline to return the CAS if it gets a
> shutdown signal
> or
> 2. Set an expiration on the collection reader to expire a CAS that is not
> returns and issue a new one into the CAS pool
>
>
> Thanks
>
>
> Daniel Cosio
> dcco...@gmail.com
>
>
>
>


Re: Towards a (new) UIMA CAS JSON format - feedback welcome!

2021-08-26 Thread Eddie Epstein
Richard,
Looks promising! I put a few comments in the drive document.
Regards, Eddie

On Fri, Aug 20, 2021 at 5:27 AM Richard Eckart de Castilho 
wrote:

> Hi all,
>
> to facilitate working with UIMA CAS data and to promote interoperability
> between different UIMA implementations, a new UIMA JSON CAS format is in
> the works - you may already have noticed a corresponding issue in Jira
> as well as prototype pull requests in the Apache UIMA Java SDK as well
> as in DKPro Cassis. However, the work so far was only preliminary with
> the goal of creating a reasonably comprehensive specification draft and
> that would be suitable for general comments.
>
> That draft is now available here and the comment functionality is enabled:
>
>
> https://docs.google.com/document/d/1tHQKbN4rPKOlkjQFGIEoIzI4ZBWRRzPBdOWMK6MIpV8/edit?usp=sharing
>
> If you think a JSON format for the UIMA CAS is a good idea, please have
> a look and provide any comments directly in the document or alternatively
> here to the list.
>
> The two prototype implementations can be found here if you would like
> to play around with them:
>
> * Apache UIMA Java SDK : https://github.com/apache/uima-uimaj/pull/137
> * DKPro Cassis (Python): https://github.com/dkpro/dkpro-cassis/pull/169
>
> Note that the prototype implementations largely but not fully follow the
> latest specification draft - this is all early work-in-progress.
>
> Looking forward to your comments!
>
> Best,
>
> -- Richard
>
> P.S.: this mail is cross-posted to the UIMA developers list. Please
>   send any replies to the users list.


Re: UIMA DUCC slow processing

2020-06-15 Thread Eddie Epstein
The time sequence of a DUCC job is as follows:
1. The JobDriver is started and the CR.init method called
2. When CR.init completes successfully one or more JobProcesses are started
and the Aggregate pipeline init method in each called
3. If the first pipeline init to complete is successful the DUCC job status
changes to RUNNING

The Processes tab on the job details page shows the init times for the JD
(JobDriver) and each of the JobProcesses. The ducc.log file on the Files
tab gives timestamps for job state changes.

Reported initialization times correspond to the init() method calls of the
UIMA components. Is the initialization delay in the CR init, or the
JobProcess init? Anything interesting in the logfiles for those components?

Normally the number of tasks should match the number of workitems. These
can be quite different if the JobProcess is using a custom UIMA-AS
asynchronous threading model. What do you see on the Work Items tab?

For debugging, DUCC's --all_in_one option allow running all the components,
CR + CM + AE + CC, in a single thread in the same process. I'd suggest that
for the CasConsummer issue. If that works, and if you are running multiple
pipelines then there is likely a thread safety issue involved with
Elasticsearch API.

Eddie

On Mon, Jun 15, 2020 at 1:30 AM Dr. Raja M. Suleman <
raja.m.sulai...@gmail.com> wrote:

> Thank you very much for your response.
>
> Actually I am working on a project that would require horizontal scaling
> therefore I am focused on DUCC at the moment. My original query started
> with my question regarding a job I had created which was giving me a low
> throughput. The pipeline for this job looks like this:
>
>1.  A CollectionReader connects to an Elasticsearch server and reads ids
>from an index and adds *1* id in each workitem which is then passed to
>the CasMultipler.
>2. The CASMultiplier uses the 'id' in each workitem to get the 'document
>text' from the Elasticsearch index. Each document text is a short
> review (1
>- 20 lines) of English. In the Abstract 'next()' method I create an
> empty
>JCas object and add the document text and other details related to the
>review to the DocumentInfo(newcas) and return the JCas object.
>3. My AnalysisEngine is running sentiment analysis on the document text.
>sentiment analysis is a computationally expensive operation specially
> for
>longer reviews.
>4. Finally my CasConsumer is writing each DocumentInfo object into a
>Elasticsearch index.
>
>
> A few things I noticed running this jobs and would be grateful for your
> comments on them:
>
>1. The job's initialization time increases with the number of documents
>in the index exponentially. I'm using the Elasticsearch scroll API which
>returns all the document ids within milliseconds. However, the DUCC job
>takes a long time to start running (~35 minutes for 100k documents).
> I've
>noticed that the initialization time for the DUCC job increases
>exponentially with the number of records. Is this due to the new CASes
>being generated for each in CollectionReader.
>2.  While checking the Performance tab of a job in the webserver UI, I
>noticed that under the "Tasks" column, the number of Tasks for all the
>components except the AnalysisEngine (AE) is twice the number of
> documents
>processed, e.g. if the job has processed 100 documents, it will show 200
>tasks for all components and 100 for the AE component.
>3. In the CasConsumer, I tried to use the BulkProcessor provided by the
>Elasticsearch Java API, which works asynchronously to send bulk indexing
>requests. However, asynchronous calls weren't registering and the
>CasConsumer would return without writing anything in the Elasticsearch
>index. I checked the job logs and couldn't find any error messages.
>
> I'm sorry for another long message and I truly am grateful to you for your
> kind guidance.
>
> Thank you very much.
>
> On Mon, 15 Jun 2020, 00:34 Eddie Epstein,  wrote:
>
> > I forgot to add, if your application does not require horizontal scale
> out
> > to many CPUs on multiple machines, UIMA has a vertical scale out tool,
> the
> > CPE, that can support running multiple pipeline threads on a single
> > machine.
> > More information is at
> >
> >
> http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe
> >
> >
> >
> >
> > On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein 
> wrote:
> >
> > > In this case the problem is not DUCC, rather it is the high overhead of
> > > opening small files and sending them 

Re: UIMA DUCC slow processing

2020-06-14 Thread Eddie Epstein
I forgot to add, if your application does not require horizontal scale out
to many CPUs on multiple machines, UIMA has a vertical scale out tool, the
CPE, that can support running multiple pipeline threads on a single
machine.
More information is at
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe




On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein  wrote:

> In this case the problem is not DUCC, rather it is the high overhead of
> opening small files and sending them to a remote computer individually. I/O
> works much more efficiently with larger blocks of data. Many small files
> can be merged into larger files using zip archives. DUCC sample code shows
> how to do this for CASes, and very similar code could be used for input
> documents as well.
>
> Implementing efficient scale out is highly dependent on good treatment of
> input and output data.
> Best,
> Eddie
>
>
> On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
> raja.m.sulai...@gmail.com> wrote:
>
>> Hello,
>>
>> Thank you very much for your response and even more so for the detailed
>> explanation.
>>
>> So, if I understand it correctly, DUCC is more suited for scenarios where
>> we have large input documents rather than many small ones?
>>
>> Thank you once again.
>>
>> On Fri, 12 Jun 2020, 22:18 Eddie Epstein,  wrote:
>>
>> > Hi,
>> >
>> > In this simple scenario there is a CollectionReader running in a
>> JobDriver
>> > process, delivering 100K workitems to multiple remote JobProcesses. The
>> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
>> workitems
>> > = 18 milliseconds per workitem. This time is roughly the expected
>> overhead
>> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
>> > recording the results. DUCC jobs are much more efficient if the overhead
>> > per workitem is much smaller than the processing time.
>> >
>> > Typically DUCC jobs would be processing much larger blocks of content
>> per
>> > workitem. For example, if a workitem was a document, and the document
>> > parsed into the small CASes by the CasMultiplier, the throughput would
>> be
>> > much better. However, with this example, as the number of working
>> > JobProcess threads is scaled up, the CR (JobDriver) would become a
>> > bottleneck. That's why a typical DUCC Job will not send the Document
>> > content as a workitem, but rather send a reference to the workitem
>> content
>> > and have the CasMultipliers in the JobProcesses read the content
>> directly
>> > from the source.
>> >
>> > Even though content read by the JobProcesses is much more efficient, as
>> > scaleout continued to increase for this non-computation scenario the
>> > bottleneck would eventually move to the underlying filesystem or
>> whatever
>> > document source and JobProcess output are. The main motivation for DUCC
>> was
>> > jobs similar to those in the DUCC examples which use OpenNLP to process
>> > large documents. That is, jobs where CPU processing is the bottleneck
>> > rather than I/O.
>> >
>> > Hopefully this helps. If not, happy to continue the discussion.
>> > Eddie
>> >
>> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
>> > raja.m.sulai...@gmail.com> wrote:
>> >
>> > > Hi,
>> > > Thank you for your reply and I'm sorry I couldn't get back to this
>> > > earlier.
>> > >
>> > > To get a better picture of the processing speed of DUCC, I made a
>> dummy
>> > > pipeline where the CollectionReader runs a for loop to generate 100k
>> > > workitems (so no disk reads). each workitem only has a simple string
>> in
>> > it.
>> > > These are then passed on to the CasMultiplier where for each workitem
>> I'm
>> > > creating a new CAS with DocumentInfo (again only having a simple
>> string
>> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
>> > doesn't
>> > > do anything except add the Document received in the CAS to the
>> logger. So
>> > > basically this pipeline isn't doing anything, no Input reads and the
>> only
>> > > output is the information added to the logger. Running this on the
>> > cluster
>> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
>> > than
>> > > 30 minutes. I don't understand

Re: UIMA DUCC slow processing

2020-06-14 Thread Eddie Epstein
In this case the problem is not DUCC, rather it is the high overhead of
opening small files and sending them to a remote computer individually. I/O
works much more efficiently with larger blocks of data. Many small files
can be merged into larger files using zip archives. DUCC sample code shows
how to do this for CASes, and very similar code could be used for input
documents as well.

Implementing efficient scale out is highly dependent on good treatment of
input and output data.
Best,
Eddie


On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
raja.m.sulai...@gmail.com> wrote:

> Hello,
>
> Thank you very much for your response and even more so for the detailed
> explanation.
>
> So, if I understand it correctly, DUCC is more suited for scenarios where
> we have large input documents rather than many small ones?
>
> Thank you once again.
>
> On Fri, 12 Jun 2020, 22:18 Eddie Epstein,  wrote:
>
> > Hi,
> >
> > In this simple scenario there is a CollectionReader running in a
> JobDriver
> > process, delivering 100K workitems to multiple remote JobProcesses. The
> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
> workitems
> > = 18 milliseconds per workitem. This time is roughly the expected
> overhead
> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
> > recording the results. DUCC jobs are much more efficient if the overhead
> > per workitem is much smaller than the processing time.
> >
> > Typically DUCC jobs would be processing much larger blocks of content per
> > workitem. For example, if a workitem was a document, and the document
> > parsed into the small CASes by the CasMultiplier, the throughput would be
> > much better. However, with this example, as the number of working
> > JobProcess threads is scaled up, the CR (JobDriver) would become a
> > bottleneck. That's why a typical DUCC Job will not send the Document
> > content as a workitem, but rather send a reference to the workitem
> content
> > and have the CasMultipliers in the JobProcesses read the content directly
> > from the source.
> >
> > Even though content read by the JobProcesses is much more efficient, as
> > scaleout continued to increase for this non-computation scenario the
> > bottleneck would eventually move to the underlying filesystem or whatever
> > document source and JobProcess output are. The main motivation for DUCC
> was
> > jobs similar to those in the DUCC examples which use OpenNLP to process
> > large documents. That is, jobs where CPU processing is the bottleneck
> > rather than I/O.
> >
> > Hopefully this helps. If not, happy to continue the discussion.
> > Eddie
> >
> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
> > raja.m.sulai...@gmail.com> wrote:
> >
> > > Hi,
> > > Thank you for your reply and I'm sorry I couldn't get back to this
> > > earlier.
> > >
> > > To get a better picture of the processing speed of DUCC, I made a dummy
> > > pipeline where the CollectionReader runs a for loop to generate 100k
> > > workitems (so no disk reads). each workitem only has a simple string in
> > it.
> > > These are then passed on to the CasMultiplier where for each workitem
> I'm
> > > creating a new CAS with DocumentInfo (again only having a simple string
> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
> > doesn't
> > > do anything except add the Document received in the CAS to the logger.
> So
> > > basically this pipeline isn't doing anything, no Input reads and the
> only
> > > output is the information added to the logger. Running this on the
> > cluster
> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
> > than
> > > 30 minutes. I don't understand how is this possible since there's no
> > heavy
> > > I/O processing is happening in the code.
> > >
> > > Any ideas please?
> > >
> > > Thank you.
> > >
> > > On 2020/05/18 12:47:41, Eddie Epstein  wrote:
> > > > Hi,
> > > >
> > > > Removing the AE from the pipeline was a good idea to help isolate the
> > > > bottleneck. The other two most likely possibilities are the
> collection
> > > > reader pulling from elastic search or the CAS consumer writing the
> > > > processing output.
> > > >
> > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > > > cluster. Scaleout may be of limited or no value fo

Re: UIMA DUCC slow processing

2020-06-12 Thread Eddie Epstein
Hi,

In this simple scenario there is a CollectionReader running in a JobDriver
process, delivering 100K workitems to multiple remote JobProcesses. The
processing time is essentially zero.  (30 * 60 seconds) / 100,000 workitems
= 18 milliseconds per workitem. This time is roughly the expected overhead
of a DUCC jobDriver delivering workitems to remote JobProcesses and
recording the results. DUCC jobs are much more efficient if the overhead
per workitem is much smaller than the processing time.

Typically DUCC jobs would be processing much larger blocks of content per
workitem. For example, if a workitem was a document, and the document
parsed into the small CASes by the CasMultiplier, the throughput would be
much better. However, with this example, as the number of working
JobProcess threads is scaled up, the CR (JobDriver) would become a
bottleneck. That's why a typical DUCC Job will not send the Document
content as a workitem, but rather send a reference to the workitem content
and have the CasMultipliers in the JobProcesses read the content directly
from the source.

Even though content read by the JobProcesses is much more efficient, as
scaleout continued to increase for this non-computation scenario the
bottleneck would eventually move to the underlying filesystem or whatever
document source and JobProcess output are. The main motivation for DUCC was
jobs similar to those in the DUCC examples which use OpenNLP to process
large documents. That is, jobs where CPU processing is the bottleneck
rather than I/O.

Hopefully this helps. If not, happy to continue the discussion.
Eddie

On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
raja.m.sulai...@gmail.com> wrote:

> Hi,
> Thank you for your reply and I'm sorry I couldn't get back to this
> earlier.
>
> To get a better picture of the processing speed of DUCC, I made a dummy
> pipeline where the CollectionReader runs a for loop to generate 100k
> workitems (so no disk reads). each workitem only has a simple string in it.
> These are then passed on to the CasMultiplier where for each workitem I'm
> creating a new CAS with DocumentInfo (again only having a simple string
> value) and pass it as a newcas to the CasConsumer. The CasConsumer doesn't
> do anything except add the Document received in the CAS to the logger. So
> basically this pipeline isn't doing anything, no Input reads and the only
> output is the information added to the logger. Running this on the cluster
> with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more than
> 30 minutes. I don't understand how is this possible since there's no heavy
> I/O processing is happening in the code.
>
> Any ideas please?
>
> Thank you.
>
> On 2020/05/18 12:47:41, Eddie Epstein  wrote:
> > Hi,
> >
> > Removing the AE from the pipeline was a good idea to help isolate the
> > bottleneck. The other two most likely possibilities are the collection
> > reader pulling from elastic search or the CAS consumer writing the
> > processing output.
> >
> > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > Please give a more complete picture of the processing scenario on DUCC.
> >
> > Regards,
> > Eddie
> >
> >
> > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > sulem...@edgehill.ac.uk> wrote:
> >
> > > Hi,
> > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> nodes
> > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> data
> > > from an Elasticsearch index and dump it into a new index after certain
> > > analysis engine processing. The Analysis Engine is a simple sentiment
> > > analysis code. The performance I'm getting is very slow as it is only
> able
> > > to process ~150 documents/minute.
> > > To test the performance without the analysis engine, I removed the AE
> from
> > > the pipeline but still I did not get any improvement in the processing
> > > speeds. Can you please guide me as to where I might be going wrong or
> what
> > > I can do to improve the processing speeds?
> > >
> > > Thank you.
> > > 
> > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > > Teaching Excellence Framework Gold Award<
> http://ehu.ac.uk/tef/emailfooter>
> > > 
> > > This message is private and confidential. If you have received this
> > > message in error, please notify the sender and remove it from your
> system.
> > > Any views or opinions presented are solely those of the author and do
> not
> > > necessarily represent those of Edge Hill or associated companies. Edge
> Hill
> > > University may monitor email traffic data and also the content of
> email for
> > > the purposes of security and business communications during staff
> absence.<
> > > http://ehu.ac.uk/itspolicies/emailfooter>
> > >
> >
>


Re: UIMA DUCC slow processing

2020-05-18 Thread Eddie Epstein
Hi,

Removing the AE from the pipeline was a good idea to help isolate the
bottleneck. The other two most likely possibilities are the collection
reader pulling from elastic search or the CAS consumer writing the
processing output.

DUCC Jobs are a simple way to scale out compute bottlenecks across a
cluster. Scaleout may be of limited or no value for I/O bound jobs.
Please give a more complete picture of the processing scenario on DUCC.

Regards,
Eddie


On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
sulem...@edgehill.ac.uk> wrote:

> Hi,
> I've been trying to run a very small UIMA DUCC cluster with 2 slave nodes
> having 32GB of RAM each. I wrote a custom Collection Reader to read data
> from an Elasticsearch index and dump it into a new index after certain
> analysis engine processing. The Analysis Engine is a simple sentiment
> analysis code. The performance I'm getting is very slow as it is only able
> to process ~150 documents/minute.
> To test the performance without the analysis engine, I removed the AE from
> the pipeline but still I did not get any improvement in the processing
> speeds. Can you please guide me as to where I might be going wrong or what
> I can do to improve the processing speeds?
>
> Thank you.
> 
> Edge Hill University
> Teaching Excellence Framework Gold Award
> 
> This message is private and confidential. If you have received this
> message in error, please notify the sender and remove it from your system.
> Any views or opinions presented are solely those of the author and do not
> necessarily represent those of Edge Hill or associated companies. Edge Hill
> University may monitor email traffic data and also the content of email for
> the purposes of security and business communications during staff absence.<
> http://ehu.ac.uk/itspolicies/emailfooter>
>


Re: Use of CASes with sofaURI?

2019-10-25 Thread Eddie Epstein
Besides very large documents and remote data, another major motivation was
for non-text data, such as audio or video.
Eddie

On Fri, Oct 25, 2019 at 1:33 PM Marshall Schor  wrote:

> Hi,
>
> Here's what I vaguely remember was the driving use-cases for the sofa as a
> URI.
>
> 1.  The main use case was for applications where the data was so large, it
> would
> be unreasonable to read it all in and save as a string.
>
> 2.  The prohibition on changing a sofa spec (without resetting the CAS)
> was that
> it has the potential for users to invalidate the results, in this
> (imagined)
> scenario:
>
> a) User creates cas with some sofa data,
> b) User runs annotators, which create annotations that "point into"
> the sofa
> data
> c) User changes the sofa spec, to different data, but now all the
> annotations still are pointing into "offsets" in the original data.
>
> You can change the sofa data setting, but only after resetting the CAS.
>
> Did you have a use case for wanting to change the sofa data without
> resetting the CAS?
>
>
> It sounds like you have another interesting use case:
>
> a) want to convert the sofa data uri -> a string and have the normal
> getDocumentText etc. work, but
> b) have the serialization serialize the sofaURI, and not the data
> that's
> present there.
>
> This might be a nice convenience.
>
> I can see a couple of issues:
>   a) it might need to have a good strategy for handling very large data.
> E.g.,
> the convert method might need to include a max string size spec.
>   b) Since the serialization would serialize the annotations, but not the
> data
> (it would only serialize the URI), the data at that URI could easily
> change,
> making the annotation results meaningless.  Perhaps some "fingerprinting"
> (developing a checksum of the data, and serializing that to be able to
> signal if
> that did happen) would be a reasonable protection.
>
> Maybe do a new feature-request issue?
>
> -Marshall
>
> magine the JavaDoc for this method would be saying something like: has the
> potential to exceed your memory, at run time, due to the potential size of
> the
> data...
>
>
> On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote:
> > Hi,
> >
> > On 25. Oct 2019, at 17:53, Marshall Schor  wrote:
> >> One other useful sources for examples:  The test cases for UIMA, e.g.
> search the
> >> uimaj-core projects *.java files for "getSofaDataStream".
> > Ok, let me elaborate :)
> >
> > One can use setSofaDataURI(url) to tell the CAS that the sofa data is
> actually external.
> > One can then use getSofaDataStream() resolve the URL and retrieve the
> data as a stream.
> >
> > So let's assume I have a CAS containing annotations on a text and the
> text is in an external file:
> >
> >   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null,
> null, null);
> >   cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
> >
> > Works nice when I use getSofaDataStream() to retrieve the data.
> >
> > But I can't use the "normal" methods like getDocumentText() or
> getCoveredText() at all.
> >
> > Also, I cannot call setSofaDataString(urlContent, "text/plain") - it
> throws an exception
> > because there is already a sofaURI set. This is a major inconvenience.
> >
> > The ClearTK guys came up with an approach that tries to make this a bit
> more convenient:
> >
> > * they introduce a well-known view named "UriView" and set the
> sofaDataURI in that view.
> > * then they use a special reader which looks up the URI in that view,
> resolves it and
> >   drops the content into the sofaDataString of the "_defaultView".
> >
> > That way they get the benefit of the externally stored sofa as well as
> the ability to use
> > the usual methods to access the text.
> >
> > When I looked at setSofaDataURI(), I naively expected that it would be
> resolved the first
> > time I try to access the sofa data (e.g. via getDocumentText()) - but
> that doesn't happen.
> >
> > Then I expected that I would just call getSofaDataStream() and manually
> drop the contents
> > into setSofaDataString() and that this data string would be "transient",
> i.e. not saved
> > into XMI because we already have a setSofaDataURI set... but that
> expectation was also
> > not fulfilled.
> >
> > Could it be useful to introduce some place where we can transiently drop
> data obtained
> > from the sofaDataURI such that methods like getDocumentText() and
> getCoveredText() do
> > something useful but also such that the data is not included when
> serializing the CAS to
> > whatever format?
> >
> > Cheers,
> >
> > -- Richard
>


Re: DUCC without shared file system

2019-09-05 Thread Eddie Epstein
Unless all CLI/API submissions are done from the head node, DUCC still has
a dependency on a shared filesystem to authenticate such requests for
configurations where user processes run with user credentials.

On Wed, Sep 4, 2019 at 9:41 AM Lou DeGenaro  wrote:

> The DUCC Book for the Apache-UIMA DUCC demo
> http://uima-ducc-demo.apache.org:42133/doc/duccbook.html has been updated
> with respect to Jira 6121.  In particular, see section 12.9 for an example
> use of ducc_rsync to install DUCC on an additional worker node when not
> using a shared filesystem for $DUCC_HOME.
>
> On Tue, Sep 3, 2019 at 5:06 PM Lou DeGenaro 
> wrote:
>
> > I opened Jira https://issues.apache.org/jira/browse/UIMA-6121 to track
> > this issue.
> >
> >
> > On Tue, Sep 3, 2019 at 1:51 PM Lou DeGenaro 
> > wrote:
> >
> >> You not need do anything special to run DUCC without a shared
> >> filesystem.  Simply install it on a local filesystem.  However, there is
> >> one caveat.  If the user's (e.g. DUCC jobs) log directory is not in a
> >> shared filesystem, then DUCC-Mon will not have access and the contents
> >> won't be viewable. I'll open a Jira to review the DUCC Book and
> fix/clarify
> >> shared file system requirements.
> >>
> >> Lou.
> >>
> >> On Tue, Sep 3, 2019 at 11:58 AM Wahed Hemati <
> hem...@em.uni-frankfurt.de>
> >> wrote:
> >>
> >>> Hi there,
> >>>
> >>> the release notes of DUCC 3.0.0 indicates, that one major change is,
> >>> that DUCC can now run without shared file system.
> >>>
> >>> How do I set this up? In the Duccbook however it says that you need a
> >>> shared filesystem to add more nodes
> >>> (https://uima.apache.org/d/uima-ducc-3.0.0/duccbook.html#x1-22400012.9
> ).
> >>>
> >>> Thanks in advance.
> >>>
> >>> -Wahed
> >>>
> >>>
>


Re: Customizing Sample Pinger of Uima

2019-05-10 Thread Eddie Epstein
Hi Florian,

The documentation for this is at
http://uima.apache.org/d/uima-as-2.10.3/uima_async_scaleout.html#ugr.ref.async.api.usage_targetservice

There is a test case at
https://svn.apache.org/viewvc/uima/uima-as/trunk/uimaj-as-activemq/src/test/java/org/apache/uima/ee/test/TestUimaASExtended.java?revision=1826882&view=markup#l2238

Regards,
Eddie

On Mon, May 6, 2019 at 10:13 AM Florian  wrote:

> Hi,
>
> is there eventual an example or demo for sending request to individual
> service instances? I couldn't find an example in the repository.
>
> Best regards,
> Florian
>
> On Mi, Mai 1, 2019 at 12:13 AM, Eddie Epstein 
> wrote:
> > Hi Florian,
> >
> > Interesting questions. First, yes the intended behavior is to leave 1
> > instance running. Services are either started by having
> > autostart=true, or
> > by a job or another service having a dependency on the service.
> > Logically
> > it could be possible to let a pinger stop all instances and have the
> > service still be in some kind of "running" state so that the pinger
> > would
> > continue running and be able to restart instances when it detected a
> > need;
> > all that is needed is a bit of programming :)
> >
> > A hacky approach would be not to use autostart, rather to start
> > service-A
> > by using a dummy service-B with a dependency on A. When service A
> > pinger
> > wants to stop A, it could issue a command to stop B which would allow
> > service A to be stopped. Restarting A would require an external
> > program
> > requesting B to be started again.
> >
> > For the second question, the answer is yes for UIMA-AS services. The
> > latest
> > version of UIMA-AS supports sending process requests to specific
> > service
> > instances. A pinger could send such requests, and when an instance
> > fails to
> > reply the pinger can direct that instance to be stopped and another
> > instance started. The answer is also yes for custom services for
> > which the
> > pinger knows how to address each instance.
> >
> > Regards,
> > Eddie
> >
> > On Tue, Apr 30, 2019 at 1:43 PM Florian  > <mailto:f.allgoe...@web.de>> wrote:
> >
> >>  Hello everyone,
> >>
> >>  I have two questions about the given sample pinger example of Uima.
> >>
> >>  It is possible to set the minimal numbers of instances of a service
> >> to
> >>  zero? If I set the min-variable to zero uima is always starting a
> >> new
> >>  instance, when the last one is shutdown. Is this behavior intended
> >> or
> >>  is there a way to prevent the start of a new instance, when there
> >> is no
> >>  calls to the service? As we have some services that a rarely used,
> >> we
> >>  would only like to start instances on demand.
> >>
> >>  Secondly is there also a option to call specific instances of a
> >> service
> >>  and restart them? We would like to do health checks for individual
> >>  instances and restart them if needed.
> >>
> >>  Best Regards
> >>
> >>  Florian
> >>
> >>
> >>
> >>
>
>


Re: Customizing Sample Pinger of Uima

2019-04-30 Thread Eddie Epstein
Hi Florian,

Interesting questions. First, yes the intended behavior is to leave 1
instance running. Services are either started by having autostart=true, or
by a job or another service having a dependency on the service. Logically
it could be possible to let a pinger stop all instances and have the
service still be in some kind of "running" state so that the pinger would
continue running and be able to restart instances when it detected a need;
all that is needed is a bit of programming :)

A hacky approach would be not to use autostart, rather to start service-A
by using a dummy service-B with a dependency on A. When service A pinger
wants to stop A, it could issue a command to stop B which would allow
service A to be stopped. Restarting A would require an external program
requesting B to be started again.

For the second question, the answer is yes for UIMA-AS services. The latest
version of UIMA-AS supports sending process requests to specific service
instances. A pinger could send such requests, and when an instance fails to
reply the pinger can direct that instance to be stopped and another
instance started. The answer is also yes for custom services for which the
pinger knows how to address each instance.

Regards,
Eddie

On Tue, Apr 30, 2019 at 1:43 PM Florian  wrote:

> Hello everyone,
>
> I have two questions about the given sample pinger example of Uima.
>
> It is possible to set the minimal numbers of instances of a service to
> zero? If I set the min-variable to zero uima is always starting a new
> instance, when the last one is shutdown. Is this behavior intended or
> is there a way to prevent the start of a new instance, when there is no
> calls to the service? As we have some services that a rarely used, we
> would only like to start instances on demand.
>
> Secondly is there also a option to call specific instances of a service
> and restart them? We would like to do health checks for individual
> instances and restart them if needed.
>
> Best Regards
>
> Florian
>
>
>
>


Re: DUCC Job does not work on any other language except English

2018-08-04 Thread Eddie Epstein
Hi Rohit,

Hopefully this is something fairly easy to fix. Thanks for the information.

Eddie

On Thu, Aug 2, 2018 at 2:46 AM, Rohit Yadav  wrote:

> Hi,
>
> I've tried running DUCC Job for various languages but all the content is
> replaced by (Question Mark)
>
> But for english it works fine.I was wondering maybe this is a problem in
> configuration of DUCC.
>
> Any idea about this?
>
> Best,
>
> Rohit
>
>


Re: Restrict resource of a DUCC node

2018-07-25 Thread Eddie Epstein
Hi Erik,

Your user ID has hit the limit for "max user processes" on the machine.
Note that processes and threads are the same in Linux, and a single JVM may
spawn many threads (for example many GC threads :)  This parameter used to
be ulimited for users, but there was a change in Red Hat distros to limit
users to 1024 or so a few years ago. The system admin will have to raise
the limit for users. On redhat the configuration needs to be in
/etc/security/limits.d/90-nproc.conf for RHEL v7.x

Eddie


On Wed, Jul 25, 2018 at 9:23 AM, Erik Fäßler 
wrote:

> Hi all,
>
> is there a way to tell DUCC how much resources of a node it might allocate
> to jobs? My issue is that when I let DUCC scale out my jobs with an
> appropriate memory definition via process_memory_size, I get a lot of “Init
> Fails” where each failed process log shows
>
> #
> # There is insufficient memory for the Java Runtime Environment to
> continue.
> # Cannot create GC thread. Out of system resources.
>
>
>
>
> If I raise the memory requirement per job to like 40GB (which they never
> require), this issue does not come up because only few processes get
> startet but most CPU cores are not required, then, wasting time.
>
> I can’t use the machines exclusively for DUCC, so can I tell DUCC somehow
> how many resources (CPUs, memory) it may allocate?
>
> Thanks,
>
> Erik


Re: DUCC services statistics

2018-07-19 Thread Eddie Epstein
Hi,

As you may see, the default DUCC pinger for UIMA-AS services scraps JMX
stats from the service broker to report the number of producers and
consumers, the queue depth and a few other things. This pinger also does
stat reset operations on each call to the pinger, I think every 2 minutes.
A custom pinger can easily be created that does not do the reset, or even
the getMeta call if desired. The string that pingers return to DUCC are
displayed as hover text over the entries in the "State" column, for example
over "Available".

Eddie



On Thu, Jul 19, 2018 at 7:31 AM, Daniel Baumartz <
bauma...@stud.uni-frankfurt.de> wrote:

> Hi,
>
> we have a DUCC installation with different services that are being used by
> jobs and external clients. We want to collect some statistics for a
> monitoring tool that are not included in the DUCC dashboard, e.g. how often
> and when a service/queue has been used, which services are used by which
> client at the moment...
>
> It looks like I could get these information by monitoring the ActiveMQ
> messages with an external program, or by using a custom Pinger? What would
> be the best way to handle this?
>
> Thanks,
> Daniel
>


Re: High CPU Load on Job Driver

2018-07-19 Thread Eddie Epstein
Hi Rohit,

What is the collection reader running in the job driver doing? Look at the
memory use (RSS) value for the job driver on the job details page. If
nothing is logged (be sure to check ducc.log file) my guess would be that
the JD ran out of RAM in its cgroup and was killed. The JD cgroup size can
be increased dynamically for future jobs without DUCC restart by editing
ducc.properties (ducc.jd.share.quantum) The default Xmx for JD is specified
by ducc.driver.jvm.args.

Eddie

On Wed, Jul 18, 2018 at 9:12 AM, Rohit yadav  wrote:

> Hi,
>
> I am running a job on DUCC along with CR,AE and CC.But while running Job
> Driver ,CPU consumption of JD increases to 600%.
> I wanted to know is it normal to have high CPU consumption in Job Driver.
> Also after running for 1-2 hours the DUCC stops.
> And after i restart the DUCC and check the logs of JD there is nothing
> mentioned why DUCC stopped.
> Also, my DUCC is running on 3 nodes.
> The JD is configured at the head node.
> --
> Best,
> *Rohit Yadav*
>


Re: run existing AE instance on different view

2018-07-10 Thread Eddie Epstein
I think the UIMA code uses the annotator context to map the _InitialView
and the context remains static for the life of the annotator. Replicating
annotators to handle different views has been used here too, but agree it
is ugly.

If the annotator code can be changed, then one approach would be to put
some information in a fixed _IntialView that specifies which named view(s)
should be analyzed and have all down stream annotators use that to select
the view(s) to operate on.

Also sounds possible to have a single new component use the cascopier to
create a new view that is always the one processed.

Regards,
Eddie

On Thu, Jul 5, 2018 at 8:52 AM, Jens Grivolla  wrote:

> Hi,
>
> I'm trying to run an already instantiated AE on a view other than
> _InitialView. Unfortunately, I can't just call process() on the desired
> view, as there is a call to Util.getStartingView(...)
> in PrimitiveAnalysisEngine_impl that forces it back to _InitialView.
>
> The view mapping methods I found (e.g. using and AggregateBuilder) work on
> AE descriptions, so I would need to create additional instances (with the
> corresponding memory overhead). Is there a way to remap/rename the views in
> a JCas before calling process() so that the desired view is seen as the
> _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used
> for this, but it doesn't feel quite right.
>
> Best,
> Jens
>


Re: Problem in running DUCC Job for Arabic Language

2018-07-05 Thread Eddie Epstein
So if you run the AE as a DUCC UIMA-AS service and send it CASes from some
UIMA-AS client it works OK? The full environment for all processes that
DUCC launches are available via ducc-mon under the Specification or
Registry tab for that job or managed reservation or service. Please see if
the LANG setting for the service is different from the LANG setting for the
job.

One can also see the LANG setting for a linux process-id by doing:

cat /proc//environ

The LANG to be used for a DUCC process can be set by adding to the
--environment argument "LANG=xxx" as needed

Thanks,
Eddie



On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu...@ncuindia.edu <
rohit14csu...@ncuindia.edu> wrote:

> Hey,
>  Yeah you got it right the first snippet comes in CR before the data goes
> in CAS.
> And the second snippet is in the first annotator or analysis engine(AE) of
> my Aggregate Desciptor.
> I am pretty sure this is an issue of the CAS used by DUCC because if i use
> service of DUCC in which we are supposed to send the CAS and receive the
> same CAS with added features from DUCC i get accurate results.
>
> But the only problem comes in submitting a job where the cas is generated
> by DUCC.
> This can also be a issue of the enviornment(Language) of DUCC because the
> default language is english.
>
> Bets Regards
> Rohit
>
> On 2018/07/03 13:11:50, Eddie Epstein  wrote:
> > Rohit,
> >
> > Before sending the data into jcas if i force encode it :-
> > >
> > > String content2 = null;
> > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > > jcas.setDocumentText(content2);
> > >
> >
> > Where is this code, in the job CR?
> >
> >
> >
> > >
> > > And when i go in my first annotator i force decode it:-
> > >
> > > String content = null;
> > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > > "UTF-8");
> > >
> >
> > And is this in the first annotator of the job process, i.e. the CM?
> >
> > Please be as specific as possible.
> >
> > Thanks,
> > Eddie
> >
>


Re: Problem in running DUCC Job for Arabic Language

2018-07-03 Thread Eddie Epstein
Rohit,

Before sending the data into jcas if i force encode it :-
>
> String content2 = null;
> content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> jcas.setDocumentText(content2);
>

Where is this code, in the job CR?



>
> And when i go in my first annotator i force decode it:-
>
> String content = null;
> content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> "UTF-8");
>

And is this in the first annotator of the job process, i.e. the CM?

Please be as specific as possible.

Thanks,
Eddie


Re: Problem in running DUCC Job for Arabic Language

2018-06-18 Thread Eddie Epstein
Hi Rohit,

In a DUCC job the CAS created by users CR in the Job Driver is serialized
into cas.xmi format, transported to the Job Process where it is
deserialized and given to the users analytics. Likely the problem is in CAS
serialization or deserialization, perhaps due to the active LANG
environment on the JD or JP machines?

Eddie

On Thu, Jun 14, 2018 at 1:48 AM, Rohit yadav  wrote:

> Hey,
>
> I use DUCC for english language and it works without any problem.
> But lately i tried deploying a job for Arabic Language and all the content
> of Arabic Text is replaced by *'?'* (Question Mark).
>
> I am extracting Data from Accumlo and after processing i send it to ES6.
>
> When i checked the log files of JD it shows that arabic data is coming
> into CR without any problem.
> But when i check another log file it shows that the moment data enters
> into my AE arabic content is replaced by Question mark.
> Please find the log files attached with this mail.
>
> I think this may be a problem of CM because the data is fine inside CR and
> the most interesting part is that if i try running the same pipeline
> through CPM  it works without any problem which means DUCC is facing some
> issue.
>
> I'll look forward to your reply.
>
> --
> Best Regards,
> *Rohit Yadav*
>


Re: [External Sender] Re: Runtime Parameters to Annotators Running as Services

2018-06-04 Thread Eddie Epstein
>From the original description I understand the scenario to be that the
service needs to access a database that is unknown at service
initialization time. Then the CAS received by the service must include a
handle to the database. The CAS would be generated by the client, which
sounds like in your case to include a collection reader. If the client is a
UIMA aggregate and the remote service is one of the delegates then any
annotator between the CR and the remote delegate could add content to the
CAS.

Sorry if I am missing something here.
Eddie

On Fri, Jun 1, 2018 at 9:39 AM, Osborne, John D  wrote:

> Thanks - when you say having the client putting the data in the CAS do you
> mean:
>
> 1) Putting in the CollectionReader which the client is instantiating
> 2) Some other mechanism of putting data into the CAS I am not aware of
>
> I had been using 1), but in the processing of refactoring my
> CollectionReader I was trying to slim it down and just have it pass
> document identifiers to the aggregate analysis engine. I'm fuzzy on whether
> 2) is an option and if so how to implement.
>
>  -John
>
>
> ________
> From: Eddie Epstein [eaepst...@gmail.com]
> Sent: Thursday, May 31, 2018 4:25 PM
> To: user@uima.apache.org
> Subject: [External Sender] Re: Runtime Parameters to Annotators Running as
> Services
>
> I may not understand the scenario.
>
> For meta-data that would modify the behavior of the analysis, for example
> changing what analysis is run for a  CAS, putting it into the CAS itself is
> definitely recommended.
>
> The example above is for the UIMA service to access the artifact itself
> from a remote source (presumably because it is even less efficient for the
> remote client to put the data into the CAS). That is certainly recommended
> for high scale out of analysis services, assuming that the remote source
> can handle the load and not become a worse bottleneck than just having the
> client put the data into the CAS.
>
> Regards,
> Eddie
>
> On Tue, May 29, 2018 at 1:33 PM, Osborne, John D 
> wrote:
>
> > What is the best practice for passing runtime meta-data about the
> analysis
> > to individual annotators when running UIMA-AS or UIMA-DUCC services? An
> > example would be  a database identifier for an analysis of many
> documents.
> > I can't pass this in as parameters to the aggregate analysis engine
> running
> > as a service, because I don't know what that identifier is until runtime
> > (when the application calls the service).
> >
> > I used to put such information in the JCas, having the CollectionReader
> > implementation do all this work. But I am striving to have a more
> > lightweight CollectionReader... The application can obviously write
> > metadata to a database or other shared resource, but then it becomes
> > incumbent on the AnalysisEngine to access that shared resources over the
> > network (slow).
> >
> > Any advice appreciated,
> >
> >  -John
> >
>


Re: Runtime Parameters to Annotators Running as Services

2018-05-31 Thread Eddie Epstein
I may not understand the scenario.

For meta-data that would modify the behavior of the analysis, for example
changing what analysis is run for a  CAS, putting it into the CAS itself is
definitely recommended.

The example above is for the UIMA service to access the artifact itself
from a remote source (presumably because it is even less efficient for the
remote client to put the data into the CAS). That is certainly recommended
for high scale out of analysis services, assuming that the remote source
can handle the load and not become a worse bottleneck than just having the
client put the data into the CAS.

Regards,
Eddie

On Tue, May 29, 2018 at 1:33 PM, Osborne, John D  wrote:

> What is the best practice for passing runtime meta-data about the analysis
> to individual annotators when running UIMA-AS or UIMA-DUCC services? An
> example would be  a database identifier for an analysis of many documents.
> I can't pass this in as parameters to the aggregate analysis engine running
> as a service, because I don't know what that identifier is until runtime
> (when the application calls the service).
>
> I used to put such information in the JCas, having the CollectionReader
> implementation do all this work. But I am striving to have a more
> lightweight CollectionReader... The application can obviously write
> metadata to a database or other shared resource, but then it becomes
> incumbent on the AnalysisEngine to access that shared resources over the
> network (slow).
>
> Any advice appreciated,
>
>  -John
>


Re: Batch Checkpoints with DUCC?

2018-05-16 Thread Eddie Epstein
Hi,

Yes, exactly. DUCC jobs that specify CM,AE, CC use a custom flow controller
that routes the WorkItem CAS as desired. By default the route is (CM,CC),
but this can be modified by the contents of the WorkItem feature structure
... http://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-1930009.5.3

Eddie


On Wed, May 16, 2018 at 2:56 AM, Erik Fäßler 
wrote:

> Hey Eddie, thanks again! :-)
>
> So the idea is that the work item is the CAS that the CR sent to the CM,
> right? The work item CAS consists of a list of artifacts which are output
> by the CM, processed by the pipeline and finally cached by the CC.
> Then, I can somehow (have to read this up) have the work item CAS sent to
> the CC as the effective “batch processing complete” signal.
>
> Is that correct?
>
> > On 15. May 2018, at 20:50, Eddie Epstein  wrote:
> >
> > Hi Erik,
> >
> > There is a brief discussion of this in the duccbook in section 9.3 ...
> > https://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-1880009.3
> >
> > In particular, the 3rd option, "Flushing cached data". This assumes that
> > the batch of work to be flushed is represented by each workitem CAS.
> >
> > Regards,
> > Eddie
> >
> > On Tue, May 15, 2018 at 9:21 AM, Erik Fäßler 
> > wrote:
> >
> >> And another question concerning DUCC :-)
> >>
> >> With my CPEs I use a lot the batchProcessingComplete() and
> >> collectionProcessingComplete() methods. I need them because I do a lot
> of
> >> database interactions where I need to send data in batches due to the
> >> overhead of network communication.
> >> How is that handled in DUCC? The documentation does not talk about it,
> at
> >> least it not find anything.
> >>
> >> Hints are appreciated.
> >>
> >> Thanks!
> >>
> >> Erik
>
>


Re: Batch Checkpoints with DUCC?

2018-05-15 Thread Eddie Epstein
Hi Erik,

There is a brief discussion of this in the duccbook in section 9.3 ...
https://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-1880009.3

In particular, the 3rd option, "Flushing cached data". This assumes that
the batch of work to be flushed is represented by each workitem CAS.

Regards,
Eddie

On Tue, May 15, 2018 at 9:21 AM, Erik Fäßler 
wrote:

> And another question concerning DUCC :-)
>
> With my CPEs I use a lot the batchProcessingComplete() and
> collectionProcessingComplete() methods. I need them because I do a lot of
> database interactions where I need to send data in batches due to the
> overhead of network communication.
> How is that handled in DUCC? The documentation does not talk about it, at
> least it not find anything.
>
> Hints are appreciated.
>
> Thanks!
>
> Erik


Re: DUCC job Issue

2018-04-20 Thread Eddie Epstein
DUCC is designed for multi-user environments, and in particular tries to
balance resources fairly quickly across users in a fair-share allocation.
The default mechanism used is preemption. To eliminate preemption specify a
"non-preemptable" scheduling class for jobs such as "fixed".

Other options that could be of interest include:

ducc.rm.initialization.cap
This limits allocation to jobs until initialization is successful, limiting
the "damage" to other running preemptable jobs if a new job will not even
initialize.

ducc.rm.expand.by.doubling
This limits the rate at which resources are allocated to allow gaining some
knowledge about job throughput to avoid over allocating resources.

ducc.rm.prediction
Used along with doubling to avoid unnecessary allocation.

Regards,
Eddie


On Fri, Apr 20, 2018 at 5:31 AM, priyank sharma 
wrote:

> Hey!
>
> I am facing trouble while running one job at a time in ducc. I want that
> if one job is running the other one should wait for it to complete and then
> start.
> My configuration file is attached below.
> Please help me with what I am missing.
>
> --
> Thanks and Regards
> Priyank Sharma
>
>


Re: DUCC and CAS Consumers

2018-04-16 Thread Eddie Epstein
Hi,

Are you specifying to DUCC all three component descriptors: CM, AE and CC?
I'm guessing not, but rather your CM is included in the AAE aggregate given
to DUCC as the AE_Descriptor.  Can you confirm?

Eddie

On Fri, Apr 13, 2018 at 8:21 AM, Erik Fäßler 
wrote:

> Hi Eddie, thanks for the reply!
> I did indeed create a CR and a CM where the CR only sends file references
> which are then processed by the CM into new CASes.
> You wrote
>
> >  The initial CAS created by the driver normally does not flow into the
> > AE, but typically does flow to the CC after all child CASes from the CM
> > have been processed to trigger the CC to finalize the collection.
>
> I did observe exactly that. But it also seems that the new CASes, output
> by the CM, do not flow to the CC. They stay in their AAE.
> I ran the same setup twice where
> 1. in one run, the CC is part of the AAE and the DUCC CC is left blank
> 2. in the second run, the CC is provided as a DUCC CC and not contained in
> the AAE
>
> In the first scenario, the CM-created CASes are passed to the CC and
> written to file. In the second, only one single file is written, only
> containing the artifact reference that was meant for the CM.
>
> Does that mean that when using a CM you should not specify a CC and that
> the a CM can only be used in an AAE where all downstream components are
> also included in the same AAE?
>
> Best,
>
> Erik


Re: DUCC and CAS Consumers

2018-04-11 Thread Eddie Epstein
Hi Erik,

DUCC jobs can scale out user's components in two ways, horizontally by
running multiple processes (process_deployments_max)  and vertically by
running the pipeline defined by the CM, AE and CC components in multiple
threads (process_pipeline_count).  Since the constructed top AAE is
designed to run in multiple threads, it requires multiple deployments
enabled for all pipeline components.

The CM and CC components are optional as they could be already included in
the specified process_descriptor_AE. The reason for explicitly specifying
CM and CC components is to facilitate high scale out. The Job's collection
reader should create CASes with references to data which will often be
segmented by the CM into a collection of CASes to be processed by the users
AE. The initial CAS created by the driver normally does not flow into the
AE, but typically does flow to the CC after all child CASes from the CM
have been processed to trigger the CC to finalize the collection.

More information about the job model is described in the duccbook at
https://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-181000III

Regards,
Eddie


On Wed, Apr 11, 2018 at 5:16 AM, Erik Fäßler 
wrote:

> Hi all,
>
> I am doing my first steps with UIMA DUCC. I stumbled across the issue that
> my CAS consumer has allowMultipleDeployments=false since it is supposed to
> write multiple CAS document texts into one large ZIP file.
> DUCC complains about the discrepancy of the processing AAE being allowed
> for multiple deployment but one of its containers (my consumer) is not.
> I did specify the consumer with the "process_descriptor_CC” job file key
> and was assuming that DUCC would take care of it. After all, it is a key of
> its own. But it seems the consumer is just wrapped into a new AAE together
> with my annotator AAE. This new top AAE created by DUCC causes the error:
> My own AAE is allowed for multiple deployment and so are its delegates. But
> the consumer not, of course.
>
> How to handle this case? The documentation of DUCC is rather vague at this
> point. There is the section about CAS consumer changes but it doesn’t
> mention multiple deployment explicitly.
>
> What is the “process_descriptor_CC” for when it get wrapped up into an AAE
> with the user-delivered AAE anyway?
>
> Thanks and best regards,
>
> Erik
>
>


Re: Exception: UIMA - Annotator Processing Failed

2018-02-28 Thread Eddie Epstein
Hi,

An annotation feature structure can only be added to the index of the view
it was created in.

It looks like the application at
edu.cmu.lti.oaqa.baseqa.evidence.concept.PassageConceptRecognizer.process(
PassageConceptRecognizer.java:96)*
is trying to add an annotation created in one view to the index of a
different view.

Regards,
Eddie


On Wed, Feb 28, 2018 at 1:33 AM, Fatima Zulifqar <
fatimazulifqar...@gmail.com> wrote:

> Dear,
>
> I am facing the following issue while running an open source project which
> is based upon uima framework. I didn't find any solution concerned yet.
>
> *Feb 27, 2018 11:57:39 AM
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl
> callAnalysisComponentProcess(417)*
> *SEVERE: Exception occurred*
> *org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator
> processing failed.*
> * at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.
> callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)*
> * at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.
> processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)*
> * at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(
> AnalysisEngineImplBase.java:269)*
> * at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(
> AnalysisEngineImplBase.java:284)*
> * at edu.cmu.lti.oaqa.ecd.phase.BasePhase$1.run(BasePhase.java:226)*
> * at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)*
> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)*
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)*
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)*
> * at java.lang.Thread.run(Thread.java:748)*
> *Caused by: org.apache.uima.cas.CASRuntimeException: Error - the
> Annotation
> "#1 ConceptMention
> "ptv.http://www.ncbi.nlm.nih.gov/pubmed/21631649/abstract/
> 756/abstract/1073
> "*
> *   sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/21631649/abstract/756/abstract/1073
> *
> *   begin: 223*
> *   end: 232*
> *   concept: #0 Concept*
> *  names: NonEmptyStringList*
> * head: "anti gq1b"*
> * tail: EmptyStringList*
> *  uris: EmptyStringList*
> *  ids: EmptyStringList*
> *  mentions: NonEmptyFSList*
> * head: ConceptMention*
> *sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/22698187/abstract/
> 1131/abstract/1299
> *
> *begin: 32*
> *end: 41*
> *concept: #0*
> *matchedName: "anti-GQ1b"*
> *score: NaN*
> * tail: NonEmptyFSList*
> *head: #1*
> *tail: NonEmptyFSList*
> *   head: ConceptMention*
> *  sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/23927937/abstract/303/abstract/503
> *
> *  begin: 179*
> *  end: 188*
> *  concept: #0*
> *  matchedName: "anti-GQ1b"*
> *  score: NaN*
> *   tail: NonEmptyFSList*
> *  head: ConceptMention*
> * sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/19664367/abstract/0/abstract/330
> *
> * begin: 40*
> * end: 49*
> * concept: #0*
> * matchedName: "anti-GQ1b"*
> * score: NaN*
> *  tail: NonEmptyFSList*
> * head: ConceptMention*
> *sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/25379047/abstract/140/abstract/406
> *
> *begin: 133*
> *end: 142*
> *concept: #0*
> *matchedName: "anti-GQ1b"*
> *score: NaN*
> * tail: NonEmptyFSList*
> *head: ConceptMention*
> *   sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/22698187/abstract/189/abstract/386
> *
> *   begin: 3*
> *   end: 12*
> *   concept: #0*
> *   matchedName: "anti-GQ1b"*
> *   score: NaN*
> *tail: NonEmptyFSList*
> *   head: ConceptMention*
> *  sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/22698187/abstrac

Re: Completion event for replicated components

2018-01-18 Thread Eddie Epstein
There will be a new mechanism to help do this in the upcoming
uima-as-2.10.2 version. This version includes an additional listener on
every service that can be addressed individually. A uima-as client could
then iterate thru all service instances calling CPC, assuming the client
knew about all existing instances.

This does not solve the problem for replicated components in the same
service instance. For that the thread receiving the CPC would have to use
generic methods to trigger CPC processing in all the other threads.

Eddie

On Thu, Jan 18, 2018 at 7:54 AM, n7...@t-online.de 
wrote:

> Hi,
>
> in chapter 1.5.2 in UIMA AS documentation
> https://uima.apache.org/d/uima-as-2.9.0/uima_async_
> scaleout.html#ugr.async.ov.concepts.collection_process_complete
> its stated that only one instance will receive the
> collectionProcessComplete call, if components are replicated.
>
> What is the best way to get collectionProcessComplete() or something else
> called in the replicated consumer components, when the collection is
> finished, in order to complete writing any output?
>
> Thanks and best regards,
> John
>
> 
>
> 
> Gesendet mit Telekom Mail  -
> kostenlos
> und sicher für alle!


Re: Ducc Service Registration Error

2017-11-20 Thread Eddie Epstein
Hi,

Annotator class "org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator"
was not found ... means that this class is not in the classpath specified
by the registration.

Eddie

On Mon, Nov 20, 2017 at 9:17 AM, priyank sharma 
wrote:

> Hi!
>
> When i am registering the service on the ducc it is not able to start and
> giving the error
>
> WARNING:
> org.apache.uima.resource.ResourceInitializationException: Annotator class
> "org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator" was not
> found. (Descriptor: file:/home/ducc/Uima_Sentiment
> _NLP/desc/orkash/ae/TreeBankChunkerDescriptor.xml)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:220)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initialize(PrimitiveAnalysisEngine_impl.java:170)
> at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResou
> rce(AnalysisEngineFactory_impl.java:94)
> at org.apache.uima.impl.CompositeResourceFactory_impl.produceRe
> source(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> java:279)
> at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFram
> ework.java:407)
> at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_
> impl.java:256)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initASB(AggregateAnalysisEngine_impl.java:429)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.
> java:373)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initialize(AggregateAnalysisEngine_impl.java:186)
> at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResou
> rce(AnalysisEngineFactory_impl.java:94)
> at org.apache.uima.impl.CompositeResourceFactory_impl.produceRe
> source(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> java:279)
> at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFram
> ework.java:407)
> at org.apache.uima.aae.controller.PrimitiveAnalysisEngineContro
> ller_impl.initializeAnalysisEngine(PrimitiveAnalysisEngineCo
> ntroller_impl.java:265)
> at org.apache.uima.aae.UimaAsThreadFactory$1.run(UimaAsThreadFa
> ctory.java:120)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException:
> org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:260)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:217)
> ... 16 more
>
> Nov 20, 2017 7:15:22 PM 
> org.apache.uima.adapter.jms.activemq.SpringContainerDeployer
> notifyOnInitializationFailure
> WARNING: Top Level Controller Initialization Exception.
> org.apache.uima.resource.ResourceInitializationException: Annotator class
> "org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator" was not
> found. (Descriptor: file:/home/ducc/Uima_Sentiment
> _NLP/desc/orkash/ae/TreeBankChunkerDescriptor.xml)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:220)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initialize(PrimitiveAnalysisEngine_impl.java:170)
> at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResou
> rce(AnalysisEngineFactory_impl.java:94)
> at org.apache.uima.impl.CompositeResourceFactory_impl.produceRe
> source(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> java:279)
> at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFram
> ework.java:407)
> at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_
> impl.java:256)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initASB(AggregateAnalysisEngine_impl.java:429)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.
> java:373)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysis

Re: DUCC's job goes into infintie loop

2017-11-13 Thread Eddie Epstein
Several different issues here. There is no "job completion cap", rather
there is a limit on how long an individual work item will be allowed to
process before it is labeled a timeout. The default number of such errors +
exceptions before a Job is stopped is 15. Please increase this cap if you
expect a work item to go longer.

If a job process runs out of heap space it should go OOM at which point
unpredictable things will happen.  Do you see OOM exceptions in the JP
logfiles?

As for a bug, it is still hard to understand what is happening. Newer
versions of DUCC include a ducc_gather_logs command that collects DUCC
daemon logfiles and state and makes it more likely we can understand what
is happening. No user application logfiles are included in the captured tar
file.

Regards,
Eddie

On Mon, Nov 13, 2017 at 12:33 AM, priyank sharma 
wrote:

> Yes, i am using DUCC v2.0.1 i have a three node cluster with 32gb ram,
> 40gb ram and 28gb ram. Job runs fine for 15-20 days after that it goes into
> the infinite loop with the same batch of the id's. We have a 75 minutes cap
> for a job to complete if not then its start again so after every 75 minutes
> new job starts but with the same id batch as previous and not even a single
> document ingested in the data store it goes in the same state untill we
> restarts the server.
>
> Is this because of the DUCC v2.0.1, are this version of DUCC having that
> bug?
>
> Is this problem occur because of the Java Heap Space?
>
> Please suggest something as there are nothing in the logs regarding to my
> problem.
>
> Thanks and Regards
> Priyank Sharma
>
> On Friday 10 November 2017 09:00 PM, Eddie Epstein wrote:
>
>> Hi Priyank,
>>
>> Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
>> subsequent versions, the latest being v2.2.1. Newer versions have a
>> ducc_update command that will upgrade an existing install, but given all
>> the changes since v2.0.x I suggest a clean install.
>>
>> Eddie
>>
>> On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma <
>> priyank.sha...@orkash.com>
>> wrote:
>>
>> There is nothing on the work item page and performance page on the web
>>> server. There is only one log file for the main node, no log files for
>>> other two nodes. Ducc job processes not able to pick the data from the
>>> data
>>> source and no UIMA aggregator is working for that batches.
>>>
>>> Are the issue because of the java heap space? We are giving 4gb ram to
>>> the
>>> job-process.
>>>
>>> Attaching the Log file.
>>>
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>>>
>>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>>>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>>>> the
>>>> logs by clicking on each log file name looking for any revealing
>>>> information.
>>>>
>>>> Feel free to post non-confidential snippets here, or If you'd like to
>>>> chat
>>>> in real time we can use hipchat.
>>>>
>>>> Lou.
>>>>
>>>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma <
>>>> priyank.sha...@orkash.com
>>>> wrote:
>>>>
>>>> All!
>>>>
>>>>> I have a problem regarding DUCC cluster in which a job process gets
>>>>> stuck
>>>>> and keeps on processing the same batch again and again due to maximum
>>>>> duration the batch gets reason or extraordinary status
>>>>> *"**CanceledByUser"
>>>>> *and then gets restarted with the same ID's. This usually happens after
>>>>> 15
>>>>> to 20 days and goes away after restarting the ducc cluster. While going
>>>>> through the data store that is being used by CAS consumer to ingest
>>>>> data,
>>>>> the data regarding this batch does never get ingested. So most probably
>>>>> this data is not being processed.
>>>>>
>>>>> How to check if this data is being processed or not?
>>>>>
>>>>> Are the resources the issue and why it is being processed after
>>>>> restarting
>>>>> the cluster?
>>>>>
>>>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards
>>>>> Priyank Sharma
>>>>>
>>>>>
>>>>>
>>>>>
>


Re: DUCC's job goes into infintie loop

2017-11-10 Thread Eddie Epstein
Hi Priyank,

Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
subsequent versions, the latest being v2.2.1. Newer versions have a
ducc_update command that will upgrade an existing install, but given all
the changes since v2.0.x I suggest a clean install.

Eddie

On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma 
wrote:

> There is nothing on the work item page and performance page on the web
> server. There is only one log file for the main node, no log files for
> other two nodes. Ducc job processes not able to pick the data from the data
> source and no UIMA aggregator is working for that batches.
>
> Are the issue because of the java heap space? We are giving 4gb ram to the
> job-process.
>
> Attaching the Log file.
>
> Thanks and Regards
> Priyank Sharma
>
> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>
>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>> the
>> logs by clicking on each log file name looking for any revealing
>> information.
>>
>> Feel free to post non-confidential snippets here, or If you'd like to chat
>> in real time we can use hipchat.
>>
>> Lou.
>>
>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma > >
>> wrote:
>>
>> All!
>>>
>>> I have a problem regarding DUCC cluster in which a job process gets stuck
>>> and keeps on processing the same batch again and again due to maximum
>>> duration the batch gets reason or extraordinary status
>>> *"**CanceledByUser"
>>> *and then gets restarted with the same ID's. This usually happens after
>>> 15
>>> to 20 days and goes away after restarting the ducc cluster. While going
>>> through the data store that is being used by CAS consumer to ingest data,
>>> the data regarding this batch does never get ingested. So most probably
>>> this data is not being processed.
>>>
>>> How to check if this data is being processed or not?
>>>
>>> Are the resources the issue and why it is being processed after
>>> restarting
>>> the cluster?
>>>
>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>>
>>>
>


Re: UIMA analysis from a database

2017-09-15 Thread Eddie Epstein
DUCC does have hooks to allow entire machines to be dynamically added or
removed from a running DUCC cluster. So in principal DUCC could be run as
an application under a different resource manager as long as resources were
available at the machine level. It should also be possible to run other
infrastructures under DUCC, for example where a Hadoop/Spark subcluster is
turned on and off as required.

One not cloud friendly aspect of DUCC has been a dependency on a shared
filesystem. There has been work done recently to remove this requirement,
and the latest release can run without a shared FS, but some useful
functionality is not available. Specifically, facilitating the distribution
of user application code to worker machines, and automatically retrieving
user logfiles written to local disk to the DUCC web console.

regards,
Eddie

On Fri, Sep 15, 2017 at 2:54 PM, Fox, David  wrote:

> Another thanks to all contributing to this thread.
>
> We¹re looking to transition a NLP large application processing ~30TB/month
> from a custom NLP framework to UIMA-AS, and from parallel processing on a
> dedicated cluster with custom python scripts which call gnu parallel, to
> something with better support for managing resources on a shared cluster.
>
> Both our internal IT/engineering group and our cluster vendor
> (HortonWorks) use and support Hadoop/Spark/YARN on a new shared cluster.
> DUCC¹s capabilities seem to overlap with these more general purpose tools.
>  Although it may be more closely aligned with UIMA for a dedicated
> cluster, I think the big question for us would be how/whether it would
> play nicely with other Hadoop/Spark/YARN jobs on the shared cluster.
> We¹re also likely to move at least some of our workload to a cloud
> computing host, and it seems like Hadoop/Spark are much more likely to be
> supported there.
>
> David Fox
>
> On 9/15/17, 1:57 PM, "Eddie Epstein"  wrote:
>
> >There are a few DUCC features that might be of particular interest for
> >scaling out UIMA analytics.
> >
> > - all user code for batch processing continues to use the existing UIMA
> >component model: collection readers, cas multiplers, analysis engines, and
> >cas consumers.**
> >
> > - DUCC supports assembling and debugging a single threaded process with
> >these components, and then with no code change launch a highly scaled out
> >deployment.
> >
> > - for applications that use too much RAM to be able to utilize all the
> >cores on worker machines, DUCC can do the vertical (thread) scaleout
> >needed
> >to share memory.
> >
> > - DUCC automatically captures the performance breakdown of the UIMA-based
> >processes, as well as capturing process statistics including CPU, RAM,
> >swap, pagefaults and GC. Performance breakdown info for individual tasks
> >(DUCC work items) can optionally be captured.
> >
> > - DUCC has extensive error handling, automatically resubmitting work
> >associated with uncaught exceptions, process crashes, machine failures,
> >network failures, etc.
> >
> > - Exceptions are convenient to get to, and an attempt is made to make
> >obvious things that might be tricky to find, such all the reasons a
> >process
> >might fail to start, without having to dig through DUCC framework logs.
> >
> >** DUCC services introduce a new user programmable component, a service
> >pinger, that is responsible for validating that a service is operating
> >correctly. The service pinger can also dynamically change the number of
> >instances of a service, and it can restart individual instances that are
> >determined to be acting badly.
> >
> >Eddie
> >
> >On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D 
> >wrote:
> >
> >> Thanks Richard and Nicholas,
> >>
> >> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?
> >>
> >> It's also doing UIMA on Spark - I'm wondering if you are aware of it and
> >> how it is different from your own project?
> >>
> >> Thanks for any info,
> >>
> >>  -John
> >>
> >>
> >> 
> >> From: Richard Eckart de Castilho [r...@apache.org]
> >> Sent: Friday, September 15, 2017 5:29 AM
> >> To: user@uima.apache.org
> >> Subject: Re: UIMA analysis from a database
> >>
> >> On 15.09.2017, at 09:28, Nicolas Paris  wrote:
> >> >
> >> > - UIMA-AS is another way to program UIMA
> >>
> >> Here you probably meant uimaFIT.
> >>
> >> > - UIMA-FIT is complicated
> >> > - UIMA-FIT only work wit

Re: UIMA analysis from a database

2017-09-15 Thread Eddie Epstein
There are a few DUCC features that might be of particular interest for
scaling out UIMA analytics.

 - all user code for batch processing continues to use the existing UIMA
component model: collection readers, cas multiplers, analysis engines, and
cas consumers.**

 - DUCC supports assembling and debugging a single threaded process with
these components, and then with no code change launch a highly scaled out
deployment.

 - for applications that use too much RAM to be able to utilize all the
cores on worker machines, DUCC can do the vertical (thread) scaleout needed
to share memory.

 - DUCC automatically captures the performance breakdown of the UIMA-based
processes, as well as capturing process statistics including CPU, RAM,
swap, pagefaults and GC. Performance breakdown info for individual tasks
(DUCC work items) can optionally be captured.

 - DUCC has extensive error handling, automatically resubmitting work
associated with uncaught exceptions, process crashes, machine failures,
network failures, etc.

 - Exceptions are convenient to get to, and an attempt is made to make
obvious things that might be tricky to find, such all the reasons a process
might fail to start, without having to dig through DUCC framework logs.

** DUCC services introduce a new user programmable component, a service
pinger, that is responsible for validating that a service is operating
correctly. The service pinger can also dynamically change the number of
instances of a service, and it can restart individual instances that are
determined to be acting badly.

Eddie

On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D 
wrote:

> Thanks Richard and Nicholas,
>
> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?
>
> It's also doing UIMA on Spark - I'm wondering if you are aware of it and
> how it is different from your own project?
>
> Thanks for any info,
>
>  -John
>
>
> 
> From: Richard Eckart de Castilho [r...@apache.org]
> Sent: Friday, September 15, 2017 5:29 AM
> To: user@uima.apache.org
> Subject: Re: UIMA analysis from a database
>
> On 15.09.2017, at 09:28, Nicolas Paris  wrote:
> >
> > - UIMA-AS is another way to program UIMA
>
> Here you probably meant uimaFIT.
>
> > - UIMA-FIT is complicated
> > - UIMA-FIT only work with UIMA
>
> ... and I suppose you mean UIMA-AS here.
>
> > - UIMA only focuses on text Annotation
>
> Yep. Although it has also been used for other media, e.g. video and audio.
> But the core UIMA framework doesn't specifically consider these media.
> People who apply it UIMA in the context of other media do so with custom
> type systems.
>
> > - UIMA is not good at:
> >   - text transformation
>
> It is not straight-forward but possible. E.g. the text normalizers in
> DKPro Core make use of either different views for different states of
> normalization or drop the original text and forward the normalized
> text within a pipeline by means of a CAS multiplier.
>
> >   - read data from source in parallel
> >   - write data to folder in parallel
>
> Not sure if these two are limitations of the framework
> rather than of the way that you use readers and writers
> in the particular scale-out mode you are working with.
>
> >   - machine learning interface
>
> UIMA doesn't offer ML as part of the core framework because
> that is simply not within the scope of what the UIMA framework
> aims to achieve.
>
> There are various people who have built ML around UIMA, e.g.
> ClearTK (https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__cleartk.github.io_cleartk_&d=DwICAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-
> De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=tAU9eh1Sq_D-
> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=J1-BGfzlrX9t3-
> Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM&e= ) or DKPro TC
> (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__dkpro.github.io_dkpro-2Dtc_&d=DwICAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-
> De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=tAU9eh1Sq_D-
> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=kye5D2izwKE_9V2QQW8leiKp0p-91U-
> CFwXJMFmCd3w&e= ) - and as you did, it
> can be combined in various ways with ML frameworks that
> specialize specifically on ML.
>
>
> Cheers,
>
> -- Richard
>
>
>


Re: DUCC job automatically fails and gives Reason,or extraordinary status as cancelled by User | DUCC Version: 2.0.1

2017-05-17 Thread Eddie Epstein
How long does the job run before stopping? Cancelled by user could come if
the job is submitted with cancel_on_interrupt and the client submitting the
job were stopped.

Eddie

On Tue, May 16, 2017 at 8:31 AM, Lou DeGenaro 
wrote:

> Dunno why the connection would be refused.  Are the JD and JP on the same
> or different machines?  Is the network viable between the machines on which
> each is located?
>
> Lou.
>
> On Tue, May 16, 2017 at 8:18 AM, priyank sharma  >
> wrote:
>
> > Hey!
> >
> > There were no error found in JD log.Following is a snippet of the jD log
> >
> > 14 May 2017 18:47:39,593  INFO ActionGet - T[482] engage  seqNo=3484
> > remote=S144.3170.35
> > 14 May 2017 18:47:39,641  INFO ActionGet - T[283] engage  seqNo=3485
> > remote=S144.2443.34
> > 14 May 2017 18:47:40,688  INFO ActionEnd - T[284] engage  seqNo=3470
> > remote=S144.2443.36 ended
> > in getNext
> > 14 May 2017 18:47:40,736  INFO ActionGet - T[483] engage  seqNo=3486
> > remote=S144.2443.36
> > 14 May 2017 18:47:43,207  INFO ActionEnd - T[482] engage  seqNo=3477
> > remote=S144.3346.32 ended
> > in getNext
> > 14 May 2017 18:47:43,254  INFO ActionGet - T[284] engage  seqNo=3487
> > remote=S144.3346.32
> > 14 May 2017 18:47:43,258  INFO ActionEnd - T[283] engage  seqNo=3467
> > remote=S144.2443.35 ended
> > in getNext
> > 14 May 2017 18:47:43,296  INFO ActionGet - T[483] engage  seqNo=3488
> > remote=S144.2443.35
> > 14 May 2017 18:47:44,425  INFO ActionEnd - T[283] engage  seqNo=3468
> > remote=S144.3346.34 ended
> > in getNext
> > 14 May 2017 18:47:44,605  INFO ActionGet - T[483] engage  seqNo=3489
> > remote=S144.3346.34
> > 14 May 2017 18:47:46,105  INFO ActionEnd - T[283] engage  seqNo=3480
> > remote=S144.3346.33 ended
> > in getNext
> > 14 May 2017 18:47:46,166  INFO ActionGet - T[482] engage  seqNo=3490
> > remote=S144.3346.33
> > 14 May 2017 18:47:46,233  INFO ActionEnd - T[284] engage  seqNo=3478
> > remote=S144.3346.36 ended
> > in getNext
> > 14 May 2017 18:47:46,415  INFO ActionGet - T[482] engage  seqNo=3491
> > remote=S144.3346.36
> > 14 May 2017 18:47:49,924  INFO ActionEnd - T[284] engage  seqNo=3475
> > remote=S144.3348.35 ended
> > in getNext
> > 14 May 2017 18:47:49,968  INFO ActionGet - T[482] engage  seqNo=3492
> > remote=S144.3348.35
> > 14 May 2017 18:47:50,856  INFO ActionEnd - T[283] engage  seqNo=3469
> > remote=S144.3348.32 ended
> > in getNext
> > 14 May 2017 18:47:50,918  INFO ActionGet - T[284] engage  seqNo=3493
> > remote=S144.3348.32
> > 14 May 2017 18:47:53,566  INFO ActionEnd - T[284] engage  seqNo=3459
> > remote=S144.2443.33 ended
> > in getNext
> > 14 May 2017 18:47:53,599  INFO ActionGet - T[483] engage  seqNo=3494
> > remote=S144.2443.33
> > 14 May 2017 18:47:58,507  INFO ActionEnd - T[283] engage  seqNo=3473
> > remote=S144.3348.36 ended
> > in getNext
> > 14 May 2017 18:47:58,565  INFO ActionGet - T[284] engage  seqNo=3495
> > remote=S144.3348.36
> > 14 May 2017 18:48:06,218  INFO ActionEnd - T[283] engage  seqNo=3460
> > remote=S144.3348.34 ended
> > in getNext
> > 14 May 2017 18:48:06,360  INFO ActionGet - T[483] engage  seqNo=3496
> > remote=S144.3348.34
> > 14 May 2017 18:48:09,619  INFO ActionEnd - T[283] engage  seqNo=3481
> > remote=S144.2443.32 ended
> > in getNext
> > 14 May 2017 18:48:09,674  INFO ActionEnd - T[483] engage  seqNo=3479
> > remote=S144.3170.36 ended
> > 14 May 2017 18:48:09,681  INFO ActionGet - T[284] engage  seqNo=3497
> > remote=S144.2443.32
> > in getNext
> > 14 May 2017 18:48:09,814  INFO ActionGet - T[482] engage  seqNo=3498
> > remote=S144.3170.36
> > 14 May 2017 18:48:13,464  INFO ActionEnd - T[283] engage  seqNo=3476
> > remote=S144.3346.35 ended
> > in getNext
> > 14 May 2017 18:48:13,498  INFO ActionGet - T[483] engage  seqNo=3499
> > remote=S144.3346.35
> > 14 May 2017 18:48:15,116  INFO ActionEnd - T[284] engage  seqNo=3482
> > remote=S144.3170.32 ended
> > in getNext
> > 14 May 2017 18:48:15,163  INFO ActionGet - T[283] engage  seqNo=3500
> > remote=S144.3170.32
> > 14 May 2017 18:48:17,050  INFO ActionEnd - T[284] engage  seqNo=3465
> > remote=S144.3170.33 ended
> > in getNext
> > 14 May 2017 18:48:17,141  INFO ActionGet - T[482] engage  seqNo=3501
> > remote=S144.3170.33
> > 14 May 2017 18:48:19,138  INFO ActionEnd - T[284] engage  seqNo=3471
> > remote=S144.3170.34 ended
> > 14 May 2017 18:48:19,148  INFO ActionEnd - T[283] engage  seqNo=3487
> > remote=S144.3346.32 ended
> > in getNext
> > in getNext
> > 14 May 2017 18:48:19,180  INFO ActionGet - T[483] engage  seqNo=3502
> > remote=S144.3170.34
> > 14 May 2017 18:48:19,262  INFO ActionGet - T[284] engage  seqNo=3503
> > remote=S144.3346.32
> > 14 May 2017 18:48:22,923  INFO ActionEnd - T[482] engage  seqNo=3486
> > remote=S144.2443.36 ended
> > in getNext
> > 14 May 2017 18:48:22,977  INFO ActionGet - T[284] engage  seqNo=3504
> > remote=S144.2443.36
> > 14 May 2017 18:48:32,013  INFO ActionEnd - T[284] engage  seqNo=3492
> > remote=S144.3348.35 ended
> > in getNext
> > 14 May 2017 18:48:32,055

Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-21 Thread Eddie Epstein
Hi Erik,

A few words about DUCC and your application. DUCC is a cluster controller
that includes a resource manager and 3 applications: batch processing, long
running services and singleton processes.

The batch processing application consists of a users CollectionReader which
defines work items and a users aggregate for processing work items that can
be replicated as desired across the cluster of machines. DUCC manages the
remote process scale out and distribution of work items. The aggregate can
be vertically scaled within each process so that in-heap data can be shared
by multiple instances of the aggregate. UIMA-AS is not required for this
simple threading model.

For most applications a work item is itself a collection, a CAS containing
references to the data to be processed, where the collection size is
designed to have small enough granularity to support scale out but big
enough granularity to avoid bottlenecks.

The users aggregate normally has an initial CasMultiplier that reads the
input data and creates the CASes to be fed to the rest of the pipeline.
When all children CASes are finished processing the work item CAS is routed
to the aggregate's CasConsumer to finalize the collection. DUCC considers
the work item complete only when the work item CAS is successfully
processed.

The system is quite robust to errors: uncaught exceptions, analytics
crashing, machines crashing, etc.

Regards,
Eddie


On Fri, Apr 21, 2017 at 2:12 PM, Olga Patterson 
wrote:

> Erik,
>
> My team at the VA have developed an easy way of implementing UIMA AS
> pipelines and scaling them to a large number of nodes - using Leo framework
> that extends UIMA AS 2.8.1. We have run pipelines on over 200M documents
> scaled across multiple nodes with dozens of service instances and it
> performs great.
>
> Here is some info:
> http://department-of-veterans-affairs.github.io/Leo/
>
> The documentation for Leo reflects an earlier version of Leo, but if you
> are interested in using it with Java 8 and UIMA 2.8.1, we have not released
> the latest version in on the VA github yet but we can share it with you so
> that you can test it out and possibly provide your comments back to us.
>
> Leo has simple-to-use functionality for flexible batch read and write and
> it can work with any UIMA AEs and existing descriptor files and type system
> descriptions, so if you already have a pipeline, wrapping it with Leo would
> take just a few lines of code.
>
> Let me know if you are interested and I can help you to get started.
>
> Olga Patterson
>
>
>
>
>
>
>
> -Original Message-
> From: Jaroslaw Cwiklik 
> Reply-To: "user@uima.apache.org" 
> Date: Friday, April 21, 2017 at 8:08 AM
> To: "user@uima.apache.org" 
> Subject: Re: Synchonizing Batches AE and StatusCallbackListener
>
> Erik, thanks. This is more clear what you are trying to accomplish.
> First,
> there are no plans to retire the CPE. It is supported and I don't know
> of
> any plans to retire it. The only issue is ongoing development. My
> efforts
> are focused on extending and improving UIMA-AS.
>
> I don't have an answer yet how to handle the CPE crash scenario with
> respect to batching and subsequent restart from the last known good
> batch.
> Seems like some coordination would be needed to avoid redoing the whole
> collection after a crash. Its been awhile since I've looked at the CPE.
> Will take a look and see what is possible if anything.
>
> There is another Apache UIMA project called DUCC which stands for
> Distributed Uima Cluster Computing. From your email it looks like you
> have
> a cluster of machines available. Here is a quick description of DUCC:
>
> DUCC is a Linux cluster controller designed to scale out any UIMA
> pipeline
> for high throughput collection processing jobs as well as for low
> latency
> real-tme applications. Building on UIMA-AS, DUCC is particularly well
> suited to run large memory Java analytics in multiple threads in order
> to
> fully utilize multicore machines. DUCC manages the life cycle of all
> processes deployed across the cluster, including non-UIMA processes
> such as
> tomcat servers or VNC sessions.
>
>  You can find more info on this here:
> https://uima.apache.org/doc-uimaducc-whatitam.html
>
> In UIMA-AS batching is an application concern. I am a bit fuzzy on
> implementation so perhaps someone else can comment how to implement
> batching and how to handle errors. You can use a CasMultipler and a
> custom
> FlowController to manage CASes and react to errors.The UIMA-AS service
> can
> take an input CAS representing your batch, pass it on to the
> CasMultiplier,
> generate CASes for each piece of work and deliver results to the
> CasConsumer with a FlowController in the middle orchestrating the
> flow. I
> defer to application deployment experts to provide you with more
> detail.
>
> Jerry
>
>
>
>
>
>
>
>   

Re: Free instance of agreggate with cas multiplier in MultiprocessingAnalysisEngine

2016-11-09 Thread Eddie Epstein
Sounds like a bug in MultiprocessingAnalysisEngine_impl. Any chance you
could simplify your scenario and attach it to a Jira issue against UIMA?

On Wed, Nov 9, 2016 at 1:24 PM, nelson rivera 
wrote:

> Not, for only one instance, the behavior is correct, and generate all
> child cas required.
>
> 2016-11-09 9:40 GMT-05:00, Eddie Epstein :
> > Is behavior the same for single-threaded AnalysisEngine instantiation?
> >
> > On Tue, Nov 8, 2016 at 10:00 AM, nelson rivera  >
> > wrote:
> >
> >> I have a aggregate analysis engine that contains a casmultiplier
> >> annotator. I instantiate this aggregate with the interface
> >> UIMAFramework.produceAnalysisEngine(specifier, 1, 0) for multithreaded
> >> processing. The casmultiplier generate more than one cas for each
> >> input CAS. The issue is that after first cas child, that i get with
> >>
> >>  JCasIterator casIterator =
> >> analysisEngine.processAndOutputNewCASes(jcas);
> >> while (casIterator.hasNext()) {
> >>JCas outCas = casIterator.next();
> >>...
> >> outCas.release();
> >> }
> >>
> >> after this first cas child, the MultiprocessingAnalysisEngine_impl
> >> assumes that the instance of
> >> AggregateAnalysisEngine that processes the request has ended, Y
> >> entonces esta instancia es libre para procesar otra solicitud de otro
> >> hilo, and it is not true, because missing child cas, producing
> >> concurrency errors.
> >>
> >> What is the condition of a instance of MultiprocessingAnalysisEngine
> >> that contains cas multiplier that generate many cas child for each
> >> input Cas, for determine that it finish and is free?
> >>
> >
>


Re: Free instance of agreggate with cas multiplier in MultiprocessingAnalysisEngine

2016-11-09 Thread Eddie Epstein
Is behavior the same for single-threaded AnalysisEngine instantiation?

On Tue, Nov 8, 2016 at 10:00 AM, nelson rivera 
wrote:

> I have a aggregate analysis engine that contains a casmultiplier
> annotator. I instantiate this aggregate with the interface
> UIMAFramework.produceAnalysisEngine(specifier, 1, 0) for multithreaded
> processing. The casmultiplier generate more than one cas for each
> input CAS. The issue is that after first cas child, that i get with
>
>  JCasIterator casIterator = analysisEngine.processAndOutputNewCASes(jcas);
> while (casIterator.hasNext()) {
>JCas outCas = casIterator.next();
>...
> outCas.release();
> }
>
> after this first cas child, the MultiprocessingAnalysisEngine_impl
> assumes that the instance of
> AggregateAnalysisEngine that processes the request has ended, Y
> entonces esta instancia es libre para procesar otra solicitud de otro
> hilo, and it is not true, because missing child cas, producing
> concurrency errors.
>
> What is the condition of a instance of MultiprocessingAnalysisEngine
> that contains cas multiplier that generate many cas child for each
> input Cas, for determine that it finish and is free?
>


Re: java.lang.ClassCastException with binary SerializationStrategy

2016-11-03 Thread Eddie Epstein
Is a collection reader being plugged into the UimaAsynchronousEngine? If so
does its component descriptor define or import any types? Sorry to say,
given that Xmi works the most likely problem is still type system mismatch.

Eddie

On Thu, Nov 3, 2016 at 5:37 PM, nelson rivera 
wrote:

> Yes with xmiCas serialization everything works fine. The client and
> the input Cas have identical type system definitions, because i get
> the cas from  UimaAsynchronousEngine with the line
> "asynchronousEngine.getCAS()", any idea of problem
>
> 2016-11-03 16:49 GMT-04:00, Eddie Epstein :
> > Hi,
> >
> > Binary serialization for a service call only works if the client and
> > service have identical type system definitions. Have you confirmed
> > everything works with the default XmiCas serialization?
> >
> > Eddie
> >
> > On Thu, Nov 3, 2016 at 3:51 PM, nelson rivera 
> > wrote:
> >
> >> I want to consume a service uima-as aggregate, the service have all
> >> delegates co-located, with format binary for serialization, I set
> >> SerializationStrategy as "binary" in the cliente side to the
> >> application context map used to pass initialization parameters. But
> >> when process i get this exception in te service uima-as:
> >>
> >>
> >> 01:42:00.679 - 14:
> >> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >> handleProcessRequestFromRemoteClient:
> >> WARNING:
> >> java.lang.ClassCastException: org.apache.uima.cas.impl.AnnotationImpl
> >> cannot be cast to org.apache.uima.cas.SofaFS
> >> at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:834)
> >> at
> >> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(
> >> FSIndexRepositoryImpl.java:2786)
> >> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_
> >> addFS(FSIndexRepositoryImpl.java:2763)
> >> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(
> >> FSIndexRepositoryImpl.java:2068)
> >> at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(
> >> CASImpl.java:1916)
> >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1640)
> >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1393)
> >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1386)
> >> at org.apache.uima.cas.impl.Serialization.deserializeCAS(
> >> Serialization.java:187)
> >> at org.apache.uima.aae.UimaSerializer.deserializeCasFromBinary(
> >> UimaSerializer.java:223)
> >> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.
> >> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:229)
> >> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.
> >> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
> impl.java:531)
> >> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> >> impl.handle(ProcessRequestHandler_impl.java:1062)
> >> at org.apache.uima.aae.handler.input.MetadataRequestHandler_
> >> impl.handle(MetadataRequestHandler_impl.java:78)
> >> at org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> >> onMessage(JmsInputChannel.java:731)
> >> at
> >> org.springframework.jms.listener.AbstractMessageListenerContain
> >> er.doInvokeListener(AbstractMessageListenerContainer.java:689)
> >> at
> >> org.springframework.jms.listener.AbstractMessageListenerContain
> >> er.invokeListener(AbstractMessageListenerContainer.java:649)
> >> at
> >> org.springframework.jms.listener.AbstractMessageListenerContain
> >> er.doExecuteListener(AbstractMessageListenerContainer.java:619)
> >> at
> >> org.springframework.jms.listener.AbstractPollingMessageListener
> >> Container.doReceiveAndExecute(AbstractPollingMessageListener
> >> Container.java:307)
> >> at
> >> org.springframework.jms.listener.AbstractPollingMessageListener
> >> Container.receiveAndExecute(AbstractPollingMessageListener
> >> Container.java:245)
> >> at
> >> org.springframework.jms.listener.DefaultMessageListenerContaine
> >> r$AsyncMessageListenerInvoker.invokeListener(
> >> DefaultMessageListenerContainer.java:1144)
> >> at
> >> org.springframework.jms.listener.DefaultMessageListenerContaine
> >> r$AsyncMessageListenerInvoker.executeOngoingLoop(
> >> DefaultMessageListenerContainer.java:1136)
> >> at
> >> org.springframework.jms.listener.DefaultMessageListenerContaine
> >> r$AsyncMessageListenerInvoker.run(DefaultMessageListenerContaine
> >> r.java:1033)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1145)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:615)
> >> at org.apache.uima.aae.UimaAsThreadFactory$1.run(
> >> UimaAsThreadFactory.java:132)
> >> at java.lang.Thread.run(Thread.java:745)
> >>
> >
>


Re: java.lang.ClassCastException with binary SerializationStrategy

2016-11-03 Thread Eddie Epstein
Hi,

Binary serialization for a service call only works if the client and
service have identical type system definitions. Have you confirmed
everything works with the default XmiCas serialization?

Eddie

On Thu, Nov 3, 2016 at 3:51 PM, nelson rivera 
wrote:

> I want to consume a service uima-as aggregate, the service have all
> delegates co-located, with format binary for serialization, I set
> SerializationStrategy as "binary" in the cliente side to the
> application context map used to pass initialization parameters. But
> when process i get this exception in te service uima-as:
>
>
> 01:42:00.679 - 14:
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient:
> WARNING:
> java.lang.ClassCastException: org.apache.uima.cas.impl.AnnotationImpl
> cannot be cast to org.apache.uima.cas.SofaFS
> at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:834)
> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(
> FSIndexRepositoryImpl.java:2786)
> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_
> addFS(FSIndexRepositoryImpl.java:2763)
> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(
> FSIndexRepositoryImpl.java:2068)
> at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(
> CASImpl.java:1916)
> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1640)
> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1393)
> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1386)
> at org.apache.uima.cas.impl.Serialization.deserializeCAS(
> Serialization.java:187)
> at org.apache.uima.aae.UimaSerializer.deserializeCasFromBinary(
> UimaSerializer.java:223)
> at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:229)
> at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:531)
> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.handle(ProcessRequestHandler_impl.java:1062)
> at org.apache.uima.aae.handler.input.MetadataRequestHandler_
> impl.handle(MetadataRequestHandler_impl.java:78)
> at org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> onMessage(JmsInputChannel.java:731)
> at org.springframework.jms.listener.AbstractMessageListenerContain
> er.doInvokeListener(AbstractMessageListenerContainer.java:689)
> at org.springframework.jms.listener.AbstractMessageListenerContain
> er.invokeListener(AbstractMessageListenerContainer.java:649)
> at org.springframework.jms.listener.AbstractMessageListenerContain
> er.doExecuteListener(AbstractMessageListenerContainer.java:619)
> at org.springframework.jms.listener.AbstractPollingMessageListener
> Container.doReceiveAndExecute(AbstractPollingMessageListener
> Container.java:307)
> at org.springframework.jms.listener.AbstractPollingMessageListener
> Container.receiveAndExecute(AbstractPollingMessageListener
> Container.java:245)
> at org.springframework.jms.listener.DefaultMessageListenerContaine
> r$AsyncMessageListenerInvoker.invokeListener(
> DefaultMessageListenerContainer.java:1144)
> at org.springframework.jms.listener.DefaultMessageListenerContaine
> r$AsyncMessageListenerInvoker.executeOngoingLoop(
> DefaultMessageListenerContainer.java:1136)
> at org.springframework.jms.listener.DefaultMessageListenerContaine
> r$AsyncMessageListenerInvoker.run(DefaultMessageListenerContaine
> r.java:1033)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at org.apache.uima.aae.UimaAsThreadFactory$1.run(
> UimaAsThreadFactory.java:132)
> at java.lang.Thread.run(Thread.java:745)
>


Re: UIMA DUCC limit max memory of node

2016-11-01 Thread Eddie Epstein
Hi,

You are right that ducc.agent.node.metrics.fake.memory.size will override
the agent's computation of total usable memory. This must be set as a java
property on the agent. To set this for all agents, add the following line
to site.ducc.properties in the resources folder and restart DUCC.
ducc.agent.jvm.args= -Xmx500M -Dducc.agent.node.metrics.fake.
memory.size=N
where N is in KB.

DUCC uses cgset -r cpu.shares=M to control a containers CPU. M is computed
as
   container-size-in-byes / total-memory-size-in-KB. So the maximum value
for M in a DUCC container is 1024.

cpu.shares controls CPU usage in a relative way. A container with
cpu.shares=1024 will potentially get 2x the CPU than a container with 512
shares. Note that if a container is using less than its share, other
containers will be allowed to get more than their share.

For newer OS, e.g. RHEL7, processes not put into a specific container are
put into the default container with cpu.shares = 1024. So if you break a
machine in half using fake.memory, and if DUCC were to fill its half up
with work, then the two halves of the box would have equal shares. Sounds
good for your scenario.

However, note that cpu.shares works for CPUs, not cores. so things may not
be so nice if hyperthreading is enabled. For example, consider a machine
with 32 cores and 2-way hyperthreading. A process burning 32 CPUs may
pretty much "max out" the machine even though there are 32 unused CPUs
available. To limit the DUCC half of a machine to only half the real
machine resources would require changing agent code to use the "*Ceiling
Enforcement Tunable Parameters*" which are absolute.

Eddie


On Tue, Nov 1, 2016 at 9:31 AM, Daniel Baumartz <
bauma...@stud.uni-frankfurt.de> wrote:

> Hi Eddie,
>
> ok, I will try to explain with more detail, maybe this is not how ducc is
> being used normally. We want to set up some nodes which are not exclusively
> used for ducc. For example, one of the nodes may have 100 GB, but we want
> the usable memory for ducc to only be 50 GB, not all free memory. (We also
> want to limit the CPU usage, for example only use 32 of 64 cores, but we
> have not tried to set this up yet.)
>
> We could not find any setting to achieve this behavior, so we tried using
> cgroups to limit the max usable memory for ducc. This did not work because
> ducc gets its memory info from /proc/meminfo which ignores the cgroups
> settings. After reading through the code it seems only setting
> "ducc.agent.node.metrics.fake.memory.size" (not setting up test mode) is
> doing something similar to what we want: "Comment from
> NodeMemInfoCollector.java: if running ducc in simulation mode skip memory
> adjustment. Report free memory = fakeMemorySize". But I am not sure if we
> can use this safely since it is for testing.
>
> So we basically want to give ducc an upper limit of usable memory.
>
> I hope it is a bit more clear what we want to achieve.
>
> Thanks again,
> Daniel
>
>
> Zitat von Eddie Epstein :
>
>
> Hi Daniel,
>>
>> For each node Ducc sums RSS for all "system" user processes and excludes
>> that from Ducc usable memory on the node. System users are defined by a
>> ducc.properties setting with default value:
>> ducc.agent.node.metrics.sys.gid.max = 500
>>
>> Ducc's simulation mode is intended for creating a scaled out cluster of
>> fake nodes for testing purposes.
>>
>> The only mechanism available for reserving additional memory is to have
>> Ducc run some dummy process that stays up forever. This could be a Ducc
>> service that is automatically started when Ducc starts. This could get
>> complicated for a heterogeneous set of machines and/or Ducc classes.
>>
>> Can you be more precise of what features you are looking for limiting
>> resource use of Ducc machines?
>>
>> Thanks,
>> Eddie
>>
>>
>> On Mon, Oct 31, 2016 at 10:03 AM, Daniel Baumartz <
>> bauma...@stud.uni-frankfurt.de> wrote:
>>
>> Hi,
>>>
>>> I am trying to set up nodes for Ducc that should not use all the memory
>>> on
>>> the machine. I tried to limit the memory with cgroups, but it seems Ducc
>>> is
>>> getting the memory info from /proc/meminfo which ignores the cgroups
>>> settings.
>>>
>>> Did I miss an option to specify the max usable memory? Could I safely use
>>> "ducc.agent.node.metrics.fake.memory.size" from the simulation settings?
>>> Or is there a better way to do this?
>>>
>>> Thanks,
>>> Daniel
>>>
>>>
>>>
>
>
>


Re: UIMA DUCC limit max memory of node

2016-10-31 Thread Eddie Epstein
Hi Daniel,

For each node Ducc sums RSS for all "system" user processes and excludes
that from Ducc usable memory on the node. System users are defined by a
ducc.properties setting with default value:
ducc.agent.node.metrics.sys.gid.max = 500

Ducc's simulation mode is intended for creating a scaled out cluster of
fake nodes for testing purposes.

The only mechanism available for reserving additional memory is to have
Ducc run some dummy process that stays up forever. This could be a Ducc
service that is automatically started when Ducc starts. This could get
complicated for a heterogeneous set of machines and/or Ducc classes.

Can you be more precise of what features you are looking for limiting
resource use of Ducc machines?

Thanks,
Eddie


On Mon, Oct 31, 2016 at 10:03 AM, Daniel Baumartz <
bauma...@stud.uni-frankfurt.de> wrote:

> Hi,
>
> I am trying to set up nodes for Ducc that should not use all the memory on
> the machine. I tried to limit the memory with cgroups, but it seems Ducc is
> getting the memory info from /proc/meminfo which ignores the cgroups
> settings.
>
> Did I miss an option to specify the max usable memory? Could I safely use
> "ducc.agent.node.metrics.fake.memory.size" from the simulation settings?
> Or is there a better way to do this?
>
> Thanks,
> Daniel
>
>


Re: Uima Ducc Service restart on timeout

2016-10-29 Thread Eddie Epstein
Hi Wahed,

One approach would be to configure the service itself to self-destruct if
processing exceeds a processing threshold. UIMA-AS error configuration does
support timeouts for remote delegates, but not for in-process delegates. So
this would require starting a timer thread in the annotator that would call
exit() if not reset at the end of the process() method. DUCC will
automatically attempt to restart a service instance that exits.

DUCC's pinger API allows a service pinger to detect that a service instance
is not working correctly and tell DUCC to restart the instance. This
approach has been confirmed to work for the current trunk code, after post
v2.1.0 fixes.

Eddie


On Fri, Oct 28, 2016 at 10:07 AM, Wahed Hemati 
wrote:

> Hi,
>
> is there a mechanism in Ducc to restart a service, if it is processing a
> cas for to long?
>
> I have a annotator running as a primitive service on Ducc, which sometimes
> gets to an endless loop. I call this service with a Uima AS Client. I can
> set a timeout on the UIMA AS Client, this works. The client throws a
> timeout after the specified period, which is nice. However the Service is
> still processing the cas somehow? Can i tell Ducc to shutdown and restart a
> service, if it is processing a cas takes more then a specified time period
> (e.g 60 sec)?
>
> Thanks in advance
>
> -Wahed
>
>
>


Re: C++/Python annotators in Eclipse on Mac OS

2016-05-06 Thread Eddie Epstein
Hi Sean,

There are example .mak files for compiling and creating shared libraries
from C++ annotator code. A couple of env parameters need to be set for the
build. It should be straightforward to configure eclipse CDT to build an
annotator and a C++ application calling annotators from a makefile.

Python annotators sit on top of a C++ annotator. No idea about eclipse
support for that kind of thing.

Eddie


On Thu, May 5, 2016 at 3:44 PM, Sean Crist  wrote:

>
> Hello,
>
> I’m trying to set up the ability to write annotators in C++ and in Python
> using Eclipse on Mac OS X.
>
> I read the following two sources:
>
> https://uima.apache.org/doc-uimacpp-huh.html
>
> Also the README file in the download of UIMACPP
>
> Both documents seem geared for using UIMA from the command line in Windows
> or Linux.  It wasn’t immediately evident how to translate those
> instructions to my situation.  There were a few passing mentions of Eclipse
> or Mac OS, but nothing like a step-by-step.
>
> Is there a writeup on this that I’ve missed in my Google search?  Absent
> that, any pointers or suggestions on how to proceed?
>
> Thanks,
> —Sean Crist
>
>
>


Re: UIMACPP and multi-threading

2016-04-28 Thread Eddie Epstein
Benjamin,

Initial testing with the latest AMQ broker indicates an incompatibility
with the existing UIMACPP release. Along with the problems you have exposed
there is good motivation to get another uimacpp release out relatively
soon. THanks for exposing the GC/threading issue with the JNI and potential
fixes.

Eddie

On Tue, Apr 26, 2016 at 3:47 AM, Benjamin De Boe <
benjamin.de...@intersystems.com> wrote:

> Hi Eddie,
>
> I'm not familiar with the serializeJNI issue.
> Few sources still recommend implementing finalize(), because it is
> undetermined in which order the GC process will eventually invoke them. We
> also thought it was counterintuitive to see the UimacppEngine being
> finalized before the UimacppAnalysisComponent that wraps it, but that's
> what our extra logs quite consistently seemed to indicate, so that's
> probably just what the word "non-deterministic" means.
>
> This article suggests a few alternatives that may be considered for this
> UIMACPP / JNI issue in the long run:
> http://www.oracle.com/technetwork/java/javamail/finalization-137655.html
>
>
> Thanks,
> benjamin
>
> --
> Benjamin De Boe | Product Manager
> M: +32 495 19 19 27 | T: +32 2 464 97 33
> InterSystems Corporation | http://www.intersystems.com
>
> -Original Message-
> From: Eddie Epstein [mailto:eaepst...@gmail.com]
> Sent: Tuesday, April 26, 2016 4:58 AM
> To: user@uima.apache.org
> Cc: Jos Denys ; Chen-Chieh Hsu <
> chen-chieh@intersystems.com>
> Subject: Re: UIMACPP and multi-threading
>
> Hi,
>
> Not the author of the JNI, but does it make sense that
> UimacppEngine.finalize() could be called while UimacppAnalysisComponent
> maintains a valid engine pointer to UimacppEngine? And once the engine
> pointer has been set to null, UimacppAnalysisComponent.destroy() will not
> call UimacppEngine.destroy(). Leaves me confused how this could happen.
>
> At any rate, do you think finalize is related to the serizalizeJNI problem?
>
> Eddie
>
>
>
>
>
> On Mon, Apr 25, 2016 at 8:27 AM, Benjamin De Boe <
> benjamin.de...@intersystems.com> wrote:
>
> > After some more debugging, it seems this is probably a garbage
> > collection issue rather than a multi-threading issue, although
> > multiple threads may well increase the likelihood of it happening.
> >
> > We've found that there are two methods on the CPP side for cleaning up
> > the memory used by the CPP engine: destroyJNI() and destructorJNI().
> > destructorJNI() is called from the UimacppEngine:finalize() method and
> > only deletes the pInstance pointer, whereas destroyJNI() does a lot
> > more work in cleaning up what lies beyond and is called through
> > UimacppEngine:destroy(), which in turn is invoked from
> UimacppAnalysisComponent:finalize().
> >
> > Now, the arcane magic in the GC process seems to first finish off the
> > UimacppEngine helper object (calling destructorJNI()) and then the
> > UimacppAnalysisComponent instance that contained the other one, with
> > its
> > destroyJNI() method then running into trouble because pInstance was
> > already deleted in destructorJNI(), causing the access violation we've
> > been struggling with.
> >
> > [logged as https://issues.apache.org/jira/browse/UIMA-4899 ]
> >
> > There are a number of ways how we could work around this (such as just
> > calling destroyJNI() in both cases, exiting early if it's already
> > cleaned up), but of course we'd hope someone of the original UIMACPP
> > team to weigh in and share the reasoning behind those two separate
> > methods and anything we're overlooking in our assessment. Anybody who
> > can recommend what we should do in the short run and how this might
> > translate into a fixed UIMA / UIMACPP release at some point? An
> > out-of-the-box 64-bit UIMACPP release would probably benefit more than
> > just us (cf https://issues.apache.org/jira/browse/UIMA-4900).
> >
> >
> >
> > Thanks,
> > benjamin
> >
> > --
> > Benjamin De Boe | Product Manager
> > M: +32 495 19 19 27 | T: +32 2 464 97 33 InterSystems Corporation |
> > http://www.intersystems.com
> >
> > -Original Message-
> > From: Eddie Epstein [mailto:eaepst...@gmail.com]
> > Sent: Thursday, April 7, 2016 1:58 PM
> > To: user@uima.apache.org
> > Subject: Re: UIMACPP and multi-threading
> >
> > Standalone.java certainly does show threading issues with uimacpp's JNI.
> > The multithread testing thru the JNI, like the one I did a few days
> > ago, was clearly not sufficient to declare it thread

Re: UIMACPP and multi-threading

2016-04-25 Thread Eddie Epstein
Hi,

Not the author of the JNI, but does it make sense that
UimacppEngine.finalize() could be called while UimacppAnalysisComponent
maintains a valid engine pointer to UimacppEngine? And once the engine
pointer has been set to null, UimacppAnalysisComponent.destroy() will not
call UimacppEngine.destroy(). Leaves me confused how this could happen.

At any rate, do you think finalize is related to the serizalizeJNI problem?

Eddie





On Mon, Apr 25, 2016 at 8:27 AM, Benjamin De Boe <
benjamin.de...@intersystems.com> wrote:

> After some more debugging, it seems this is probably a garbage collection
> issue rather than a multi-threading issue, although multiple threads may
> well increase the likelihood of it happening.
>
> We've found that there are two methods on the CPP side for cleaning up the
> memory used by the CPP engine: destroyJNI() and destructorJNI().
> destructorJNI() is called from the UimacppEngine:finalize() method and only
> deletes the pInstance pointer, whereas destroyJNI() does a lot more work in
> cleaning up what lies beyond and is called through UimacppEngine:destroy(),
> which in turn is invoked from UimacppAnalysisComponent:finalize().
>
> Now, the arcane magic in the GC process seems to first finish off the
> UimacppEngine helper object (calling destructorJNI()) and then the
> UimacppAnalysisComponent instance that contained the other one, with its
> destroyJNI() method then running into trouble because pInstance was already
> deleted in destructorJNI(), causing the access violation we've been
> struggling with.
>
> [logged as https://issues.apache.org/jira/browse/UIMA-4899 ]
>
> There are a number of ways how we could work around this (such as just
> calling destroyJNI() in both cases, exiting early if it's already cleaned
> up), but of course we'd hope someone of the original UIMACPP team to weigh
> in and share the reasoning behind those two separate methods and anything
> we're overlooking in our assessment. Anybody who can recommend what we
> should do in the short run and how this might translate into a fixed UIMA /
> UIMACPP release at some point? An out-of-the-box 64-bit UIMACPP release
> would probably benefit more than just us (cf
> https://issues.apache.org/jira/browse/UIMA-4900).
>
>
>
> Thanks,
> benjamin
>
> --
> Benjamin De Boe | Product Manager
> M: +32 495 19 19 27 | T: +32 2 464 97 33
> InterSystems Corporation | http://www.intersystems.com
>
> -Original Message-
> From: Eddie Epstein [mailto:eaepst...@gmail.com]
> Sent: Thursday, April 7, 2016 1:58 PM
> To: user@uima.apache.org
> Subject: Re: UIMACPP and multi-threading
>
> Standalone.java certainly does show threading issues with uimacpp's JNI.
> The multithread testing thru the JNI, like the one I did a few days ago,
> was clearly not sufficient to declare it thread safe.
>
> Our local uimacpp development with regards thread safety was focused on
> multithread testing for the development of uimacpp's native AMQ service
> wrapper.
>
> If you do fix the JNI threading issues please consider contributing them
> back to ASF!
> Eddie
>
> On Tue, Apr 5, 2016 at 8:54 AM, Jos Denys 
> wrote:
>
> > Hi Eddie,
> >
> > I worked on the CPP-side, and what I noticed was that the JNI
> > Interface always passes an instance pointer :
> >
> > JNIEXPORT void JNICALL JAVA_PREFIX(resetJNI) (JNIEnv* jeEnv, jobject
> > joJTaf) {
> >   try {
> > UIMA_TPRINT("entering resetDocument()");
> >
> > uima::JNIInstance* pInstance = JNIUtils::getCppInstance(jeEnv,
> > joJTaf);
> >
> >
> > Now the strange thing, and finally what caused the acces violation
> > error, was that the pInstance pointer was the same for the 3 threads
> > that
> > (simultaneously) did the UIMA processing, so it looks like the same
> > CAS was passed for 3 different analysis worker threads.
> >
> > Any idea why and how this can happen ?
> >
> > Thanks for your feedback,
> > Jos Denys,
> > InterSystems Benelux.
> >
> >
> > De : Benjamin De Boe
> > Envoyé : mardi 5 avril 2016 09:33
> > À : user@uima.apache.org
> > Cc : Jos Denys ; Chen-Chieh Hsu <
> > chen-chieh@intersystems.com> Objet : RE: UIMACPP and
> > multi-threading
> >
> >
> > Hi Eddie,
> >
> >
> >
> > Thanks for your prompt response.
> >
> > In our experiment, we have one initial thread instantiating a CasPool
> > and then passing it on to newly spawned threads that each have their
> > own DaveDetector instance and fetch a new CAS from the shared pool.
> > The UimacppEngine objects'

Re: UIMACPP and multi-threading

2016-04-07 Thread Eddie Epstein
ier), (casPoolSize > 0) ? pool :
> null);
>
> Thread t = new Thread(task);
>
> t.start();
>
> }
>
> }
>
>
>
> @Override
>
> public void run() {
>
>
>
> CAS cas  = null;
>
> try {
>
> if (pool != null) {
>
> cas = pool.getCas();
>
> } else {
>
> cas =
> CasCreationUtils.createCas(ae.getAnalysisEngineMetaData());
>
> }
>
>
>
> cas.setDocumentText(text);
>
> ae.process(cas);
>
>
>
> System.out.println("Done processing text");
>
>
>
> } catch (Exception e) {
>
> e.printStackTrace();
>
> } finally {
>
> if (pool != null) pool.releaseCas(cas);
>
> }
>
> }
>
> }
>
>
>
>
>
> Probably also of note: we sometimes get a simple exception on destroyJNI()
> (pasted below), rather than the outright total process crash described
> earlier. We assume this is just “luck” in that the different threads are
> invoking a not-so-critical section.
>
>
>
> Apr 05, 2016 9:25:25 AM org.apache.uima.uimacpp.UimacppAnalysisComponent
> logJTafException
>
> SEVERE: The following internal exception was caught: 5,002
> (UIMA_ERR_ENGINE_UNEXPECTED_EXCEPTION)
>
> Apr 05, 2016 9:25:25 AM org.apache.uima.uimacpp.UimacppAnalysisComponent
> logJTafException(431)
>
> SEVERE:
>
> Error number  : 5002
>
> Recoverable   : No
>
> Error : Unexpected error
>
> (5002)
>
> org.apache.uima.uimacpp.InternalTafException:
>
> Error number  : 5002
>
> Recoverable   : No
>
> Error : Unexpected error
>
> (5002)
>
> at org.apache.uima.uimacpp.UimacppEngine.destroyJNI(Native Method)
>
> at
> org.apache.uima.uimacpp.UimacppEngine.destroy(UimacppEngine.java:304)
>
> at
> org.apache.uima.uimacpp.UimacppAnalysisComponent.destroy(UimacppAnalysisComponent.java:338)
>
> at
> org.apache.uima.uimacpp.UimacppAnalysisComponent.finalize(UimacppAnalysisComponent.java:354)
>
> at java.lang.System$2.invokeFinalize(System.java:1270)
>
> at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98)
>
> at java.lang.ref.Finalizer.access$100(Finalizer.java:34)
>
> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:210)
>
>
>
>
>
>
>
> Many thanks for your feedback,
>
>
>
> benjamin
>
>
>
>
>
> --
>
> Benjamin De Boe | Product Manager
>
> M: +32 495 19 19 27 | T: +32 2 464 97 33
>
> InterSystems Corporation | http://www.intersystems.com
>
>
>
> -Original Message-
>
> From: Eddie Epstein [mailto:eaepst...@gmail.com]
>
> Sent: Tuesday, April 5, 2016 12:47 AM
>
> To: user@uima.apache.org<mailto:user@uima.apache.org>
>
> Subject: Re: UIMACPP and multi-threading
>
>
>
> Hi Benjamin,
>
>
>
> UIMACPP is thread safe, as is the JNI interface. To confirm, I just
> created a UIMA-AS service with 10 instances of DaveDetector, and fed the
> service
>
> 800 CASes with up to 10 concurrent CASes at any time.
>
>
>
> It is not the case with DaveDetector, but at annotator initialization some
> analytics will store info in thread local storage, and expect the same
> thread be used to call the annotator process method. UIMA-AS and DUCC
> guarantee that an instantiated AE is always called on the same thread.
>
>
>
> Eddie
>
>
>
>
>
>
>
> On Mon, Apr 4, 2016 at 10:56 AM, Benjamin De Boe <
> benjamin.de...@intersystems.com<mailto:benjamin.de...@intersystems.com>>
> wrote:
>
>
>
> > Hi,
>
> >
>
> > We're working with a UIMACPP annotator (wrapping our existing NLP
>
> > library) and are running in what appears to be thread safety issues,
>
> > which we can reproduce with the DaveDetector demo AE.
>
> > When separate threads are accessing separate instances of the
>
> > org.apache.uima.uimacpp.UimacppAnalysisComponent wrapper class on the
>
> > Java side, it appears they are invoking the same object on the C++
>
> > side, which results in quite a mess (access violations and process
>
> > crashes) when different threads concurrently invoke resetJNI() and
>
> > fillCASJNI() on the org.apache.uima.uimacpp.UimacppAnalysisComponent
>
> > object. When using a small CAS pool on the Java side, the problem does
>
> > not seem to occur, but it resurfaces if the CAS pool grows bigger and
>
> > memory settings are not increased accordingly. However, if this were a
>
> > pure memory issue, we had hoped to see more telling errors and just
>
> > guessing how big memory should be for larger deployments isn't very
> appealing an option either.
>
> > Adding the synchronized keyword to the relevant method of the wrapper
>
> > class on the Java side also avoids the issue, at the obvious cost of
>
> > performance. Moving to UIMA-AS is not an option for us, currently.
>
> >
>
> > Given that the documentation is not explicit about it, we're hoping to
>
> > get an unambiguous answer from this list: is UIMACPP actually supposed
>
> > to be thread-safe? We saw old and resolved JIRA's that addressed
>
> > thread-safety issues for UIMACPP, so we assumed it was the case, but
>
> > reality seems to point in the opposite direction.
>
> >
>
> >
>
> > Thanks in advance for your feedback,
>
> >
>
> > benjamin
>
> >
>
> >
>
> > --
>
> > Benjamin De Boe | Product Manager
>
> > M: +32 495 19 19 27 | T: +32 2 464 97 33 InterSystems Corporation |
>
> > http://www.intersystems.com
>
> >
>
> >
>


Re: UIMACPP and multi-threading

2016-04-04 Thread Eddie Epstein
Hi Benjamin,

UIMACPP is thread safe, as is the JNI interface. To confirm, I just created
a UIMA-AS service with 10 instances of DaveDetector, and fed the service
800 CASes with up to 10 concurrent CASes at any time.

It is not the case with DaveDetector, but at annotator initialization some
analytics will store info in thread local storage, and expect the same
thread be used to call the annotator process method. UIMA-AS and DUCC
guarantee that an instantiated AE is always called on the same thread.

Eddie



On Mon, Apr 4, 2016 at 10:56 AM, Benjamin De Boe <
benjamin.de...@intersystems.com> wrote:

> Hi,
>
> We're working with a UIMACPP annotator (wrapping our existing NLP library)
> and are running in what appears to be thread safety issues, which we can
> reproduce with the DaveDetector demo AE.
> When separate threads are accessing separate instances of the
> org.apache.uima.uimacpp.UimacppAnalysisComponent wrapper class on the Java
> side, it appears they are invoking the same object on the C++ side, which
> results in quite a mess (access violations and process crashes) when
> different threads concurrently invoke resetJNI() and fillCASJNI() on the
> org.apache.uima.uimacpp.UimacppAnalysisComponent object. When using a small
> CAS pool on the Java side, the problem does not seem to occur, but it
> resurfaces if the CAS pool grows bigger and memory settings are not
> increased accordingly. However, if this were a pure memory issue, we had
> hoped to see more telling errors and just guessing how big memory should be
> for larger deployments isn't very appealing an option either.
> Adding the synchronized keyword to the relevant method of the wrapper
> class on the Java side also avoids the issue, at the obvious cost of
> performance. Moving to UIMA-AS is not an option for us, currently.
>
> Given that the documentation is not explicit about it, we're hoping to get
> an unambiguous answer from this list: is UIMACPP actually supposed to be
> thread-safe? We saw old and resolved JIRA's that addressed thread-safety
> issues for UIMACPP, so we assumed it was the case, but reality seems to
> point in the opposite direction.
>
>
> Thanks in advance for your feedback,
>
> benjamin
>
>
> --
> Benjamin De Boe | Product Manager
> M: +32 495 19 19 27 | T: +32 2 464 97 33
> InterSystems Corporation | http://www.intersystems.com
>
>


Re: DUCC: Unable to do "Fixed" type of Reservation

2016-03-31 Thread Eddie Epstein
Hi Reshu,

Reserve type allows users to allocate an unconstrained resource. Because
reserve allocations are not constrained by cgroup containers, in v2.x these
allocations were restricted to be an entire machine.

Fixed type allocations, which are always associated with a specific user
process, have CPU and memory constrained by cgroups, if cgroups are enabled
and properly configured. If DUCC does not recognize cgroup support for a
node it falls back to monitoring memory use and killing processes that
exceed the specified threshold above requested allocation size. CGroup
status for each node is shown on the System->Machines page.

Ubuntu locates the cgroup folder differently from Red Hat and Suse OS. DUCC
v2 does have a property to specify this location, but you have found
another bug, this time hopefully only in the documentation.

The default value for this property is:
ducc.agent.launcher.cgroups.basedir=/cgroup/ducc
To override, put a different entry in
{ducc_runtime}/resource/site.ducc.properties

Regards,
Eddie


On Thu, Mar 31, 2016 at 5:48 AM, reshu.agarwal 
wrote:

> Hi,
>
> In DUCC 1.x, we are able to do fixed reservation of some of the memory of
> Nodes but We are restricted to do "reserve" type of reservation in DUCC
> 2.x. I want to know the reason for the same.
>
> I am using ubuntu for DUCC installation and not be able to configure
> c-groups in it, So, I have tried to manage RAM utilization through FIXED
> reservation in DUCC 1.x. But, Now I have no option.
>
> Hope, you can solve my problem.
>
> Cheers.
>
> Reshu.
>


Re: DUCC 2.0.1 : JP Http Client Unable to Communicate with JD

2016-01-12 Thread Eddie Epstein
Hi Reshu,

This is caused by the CollectionReader running in the JobDriver putting
character data in the work item CAS that cannot be XML serialized. DUCC
needs to do better in making this problem clear.

Two choices to fix this: 1) have the CR screen for illegal characters and
not put them in the work item CAS, or 2) assuming that the illegal
characters do not cause problems for the analytics, use the standard DUCC
job model whereby the JobDriver sends references to the raw data and
CasMultipliers in the scaled out JobProcesses create the CASes to be
processed.

Regards,
Eddie

On Mon, Jan 11, 2016 at 11:36 PM, reshu.agarwal 
wrote:

>
> Hi,
>
> I was getting this error after 17 out of 200 documents were processed. I
> am unable to find any reason for the same. Please see the error below:
>
> INFO: Asynchronous Client Has Been Initialized. Serialization Strategy:
> [SerializationStrategy] Ready To Process.
> DuccAbstractProcessContainer.deploy()  User Container deployed
>  Deployed Processing Container - Initialization Successful - Thread 32
> DuccAbstractProcessContainer.deploy() > Deploying User Container
> ... UimaProcessContainer.doDeploy()
> 11 Jan 2016 17:18:36,969  INFO AgentSession - T[29] notifyAgentWithStatus
> ... Job Process State Changed - PID:24790. Process State: Initializing. JMX
> Url:N/A Dispatched State Update Event to Agent with IP:192.168.10.126
> DuccAbstractProcessContainer.deploy()  User Container deployed
>  Deployed Processing Container - Initialization Successful - Thread 34
> DuccAbstractProcessContainer.deploy() > Deploying User Container
> ... UimaProcessContainer.doDeploy()
> DuccAbstractProcessContainer.deploy()  User Container deployed
>  Deployed Processing Container - Initialization Successful - Thread 33
> 11 Jan 2016 17:18:38,277  INFO JobProcessComponent - T[33] setState
> Notifying Agent New State:Running
> 11 Jan 2016 17:18:38,279  INFO AgentSession - T[1] notifyAgentWithStatus
> ... Job Process State Changed - PID:24790. Process State: Running. JMX
> Url:service:jmx:rmi:///jndi/rmi://user:2106/jmxrmi Dispatched State Update
> Event to Agent with IP:192.168.10.126
> 11 Jan 2016 17:18:38,281  INFO AgentSession - T[33] notifyAgentWithStatus
> ... Job Process State Changed - PID:24790. Process State: Running. JMX
> Url:service:jmx:rmi:///jndi/rmi://user:2106/jmxrmi Dispatched State Update
> Event to Agent with IP:192.168.10.126
> 11 Jan 2016 17:18:38,281  INFO HttpWorkerThread - T[33]
> HttpWorkerThread.run()  Begin Processing Work Items - Thread Id:33
> 11 Jan 2016 17:18:38,285  INFO HttpWorkerThread - T[34]
> HttpWorkerThread.run()  Begin Processing Work Items - Thread Id:34
> 11 Jan 2016 17:18:38,285  INFO HttpWorkerThread - T[32]
> HttpWorkerThread.run()  Begin Processing Work Items - Thread Id:32
> 11 Jan 2016 17:18:38,458  INFO HttpWorkerThread - T[34] run  Thread:34
> Recv'd WI:19
> 11 Jan 2016 17:18:38,468  INFO HttpWorkerThread - T[32] run  Thread:32
> Recv'd WI:18
> 11 Jan 2016 17:18:38,478  INFO HttpWorkerThread - T[33] run  Thread:33
> Recv'd WI:21
> 11 Jan 2016 17:18:38,515 ERROR DuccHttpClient - T[33] execute  Unable to
> Communicate with JD - Error:HTTP/1.1 500  : The element type
> "org.apache.uima.ducc.container.net.impl.MetaCasTransaction" must be
> terminated by the matching end-tag
> "".
> 11 Jan 2016 17:18:38,515 ERROR DuccHttpClient - T[33] execute  Content
> causing error:[B@3c0873f9
> Thread::33 ERRR::Content causing error:[B@3c0873f9
> 11 Jan 2016 17:18:38,516 ERROR DuccHttpClient - T[33] run
> java.lang.RuntimeException: JP Http Client Unable to Communicate with JD -
> Error:HTTP/1.1 500  : The element type
> "org.apache.uima.ducc.container.net.impl.MetaCasTransaction" must be
> terminated by the matching end-tag
> "".
> at org.apache.uima.ducc.transport.configuration.jp
> .DuccHttpClient.execute(DuccHttpClient.java:226)
> at org.apache.uima.ducc.transport.configuration.jp
> .HttpWorkerThread.run(HttpWorkerThread.java:178)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at org.apache.uima.ducc.transport.configuration.jp
> .UimaServiceThreadFactory$1.run(UimaServiceThreadFactory.java:85)
> at java.lang.Thread.run(Thread.java:745)
> 11 Jan 2016 17:18:38,535 ERROR DuccHttpClient - T[33] execute  Unable to
> Communicate with JD - Error:HTTP/1.1 501 Method n>POST is not defined in
> RFC 2068 and is not supported by the Servlet API
> 11 Jan 2016 17:18:38,535 ERROR DuccHttpClient - T[33] execute  Content
> causing error:[B@12e81893
> Thread::33 ERRR::Content causing error:[B@12e81893
> 11 Jan 2016 17:18:38,535 ERROR DuccHttpClient - T[33] run
> java.lang.RuntimeExc

Re: Uima-AS Cas merger with cas multiplier

2016-01-11 Thread Eddie Epstein
Hi Hemati,

If all the components listed are delegates of a single AAE, and the AAE is
deployed by UIMA-AS as "async", then by default only a single instance of
each delegate will be instantiated. Does the UIMA-AS deployment descriptor
specify more than one instance of any of the delegates?

Regards,
Eddie


On Fri, Jan 8, 2016 at 11:22 AM, Wahed Hemati 
wrote:

> Hi,
> I am seeking for a solution for my little problem.
> I have a Uima-AS AAE with a cas multiplier. My input is a simple text
> document. The Cas multiplier splits the document into two parts for which
> the cas multiplier generates new cases. The cases are routed through the
> flow and at some point of my AAE i want to merge these cases back together
> into one cas and continue routing the merged cas through other AEs.
>
> My AAE looks like this:
>
> inputText-> AE_1-> CasMultiplier -> AE_2-> AE_3 -> CasMerger -> AE_4 ->
> AE_5-> CasConsumer
>
> My CasMultiplier is similar to
> org.apache.uima.examples.casMultiplier.SimpleTextSegmenter and my Merger is
> similar to org.apache.uima.examples.casMultiplier.SimpleTextMerger.
>
> I got the following problem:
> My AAE generates two insteances of CasMerger, because the CasMultiplier
> generates two cases. How do i tell my AAE to instantiate only one CasMerger
> and route all cases generated by the CasMultiplier and processed by AE_2
> and AE_3 to that CasMerger?
>
> Kindly help.
>
> Thanks in advanced.
>
> Hemati
>
> --
> A. Wahed Hemati
> Text-Technology Lab
> Fakultät für Informatik und Mathematik
> Johann Wolfgang Goethe-Universität Frankfurt am Main
> Senckenberganlage 31
> 60325 Frankfurt am Main
> Tel: +49 69-798-28925
> Email: hem...@em.uni-frankfurt.de
> Web: http://www.hucompute.org/
>
>


Re: DUCC 1.1.0- Remain in Completing state.

2016-01-05 Thread Eddie Epstein
Hi Reshu,

Each DUCC machine has an agent responsible for starting and killing
processes.
There was a bug ( https://issues.apache.org/jira/browse/UIMA-4194 ) where
the
agent failed to issue a kill -9 against "hung" JPs when a job was stopping.
The fix is in v2.0.

Regards,
Eddie


On Tue, Jan 5, 2016 at 12:54 AM, reshu.agarwal 
wrote:

> I forget to mention one thing i.e. After Killing the job, next job is
> unable to initialize and remain in " WaitingForDriver" state. I have also
> checked sm.log,or.log,pm.log e.t.c. but failed to find any thing. I have to
> restart my DUCC for running job again.
>
> Reshu.
>
>
> On 01/05/2016 11:14 AM, reshu.agarwal wrote:
>
>> Hi,
>>
>> I am using DUCC 1.1.0 version. I am facing a issue with my job i.e. it
>> remains in completing state even after initializing the stop process. My
>> job used two processes. And Job Driver logs:
>>
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl
>> stop
>> INFO: Stopping Asynchronous Client.
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl
>> stop
>> INFO: Asynchronous Client Has Stopped.
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl$SharedConnection
>> destroy
>> INFO: UIMA AS Client Shared Connection Has Been Closed
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngine_impl stop
>> INFO: UIMA AS Client Undeployed All Containers
>>
>> One process logs:
>>
>> Jan 04, 2016 12:44:50 PM
>> org.apache.uima.adapter.jms.activemq.JmsInputChannel stopChannel
>> INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.87494
>> ShutdownNow false
>> Jan 04, 2016 12:44:50 PM
>> org.apache.uima.adapter.jms.activemq.JmsInputChannel stopChannel
>> INFO: Controller: ducc.jd.queue.87494 Stopped Listener on Endpoint:
>> queue://ducc.jd.queue.87494 Selector: Selector:Command=2000 OR Command=2002.
>>
>> But, other process do not have any log of stopping the process.
>>
>> The case is of not completely undeploying all processes. I have to use
>> command to cancel the process: /ducc_install/bin$ ./ducc_cancel --id 87494
>> --dpid 4529.
>>
>> Some times it cancelled the process otherwise I have to use "kill -9"
>> command to kill the job forcefully.
>>
>> Kindly help.
>>
>> Thanks in advanced.
>>
>> Reshu.
>>
>>
>>
>


Re: UIMA-DUCC installation with multiple machines

2015-11-30 Thread Eddie Epstein
Hi,

Did you confirm that user ducc@ducc-head can do passwordless ssh to
ducc-node-1?  If so, running ./check_ducc from the admin folder should give
some useful feedback about ducc-node-1.

Eddie

On Mon, Nov 30, 2015 at 5:14 AM, Sylvain Surcin 
wrote:

> Hello,
>
> Despite experimenting for a few weeks and reading the Ducc 2.0 doc book
> again and again, I am still unable to make it run on a test cluster of 2
> machines.
>
> I have 2 VMs (ducc-head and ducc-node-1) with a NFS share for /home/ducc
> and my $DUCC_HOME is /home/ducc/ducc_runtime.
>
> I generated and copied ssh keys so that my main user on ducc-head can do a
> passwordless ssh to ducc@ducc-head (that's what I understood from the doc,
> if not, can you be more specific, with Linux commands?).
>
> I compiled ducc_runtime on ducc-head and copied it on both machines in
> /local/ducc/bin with all the appropriate chown and chmod as stated in the
> doc, for both machines. I also edited site.ducc.properties accordingly.
>
> When I launch start_ducc (as ducc on ducc-head) I see both machines on the
> Web monitor but only ducc-head is up, while ducc-node-1 always stays
> "defined". Of course, when I submit the test job, it is executed on
> ducc-head, never on ducc-node-1.
>
> What am I missing?
> I have been stuck here for weeks. Please can you help me?
>
> Regards.
>


Re: remote Analysis Engines freely available

2015-10-13 Thread Eddie Epstein
There are several remote AE samples in the UIMA-AS sdk, currently "Apache
UIMA Version 2.6.0" download link at http://uima.apache.org/downloads.cgi.

$UIMA_HOME/examples/deploy/as includes
   Deploy_MeetingDetectorTAE.xml
   Deploy_MeetingFinder.xml
   Deploy_RoomNumberAnnotator.xml

After unpacking the tarball a quick start guide is in $UIMA_HOME/README

Eddie


On Tue, Oct 13, 2015 at 12:09 PM, Olivier Austina  wrote:

> Hi,Is there a remote analysis engine which is freely available. No matter
> the Analyser type. It is for demo only. Thank you.
> Regards
> Olivier
>


Re: C-Groups status remains off in web server after installing C-Groups

2015-10-06 Thread Eddie Epstein
The trunk, v2.1-snapshot, is not currently stable. There is a branch for
v2.0.1 which is.
Duccbook: the build thru maven hides some error messages. There is an
alternate
manual build that shows all errors which needs to be documented but may not
be.

The latest duccbook is at
http://uima.apache.org/d/uima-ducc-2.0.0/duccbook.html

Regards,
Eddie

On Tue, Oct 6, 2015 at 12:49 AM, Satya Nand Kanodia <
satya.kano...@orkash.com> wrote:

> Dear Eddie,
>
> Thank you :)
>
> It solved the issue. I was using DUCC 1.1 documentation to install
> C-Groups because I did not have v2.1 documentation .
>
> I cloned DUCC v2.1 form git-hub and tried to built it, It failed at
> executing test cases. It gave below exception.
>
> Failed to execute goal
> org.apache.maven.plugins:maven-surefire-plugin:2.16:test (default-test) on
> project uima-ducc-orchestrator: There are test failures.
>
>
> After that I tried to build it with mvn clean install -DskipTests. It was
> a success. Then acording to README file , for building
> documentation I executed mvn clean install -Pbuild-duccdocs -DskipTests.
> It worked, But there was no duccbook.
>
> After installing DUCC 2.1 from binary when I click onDuccBook on
> webserver,I got following error.
>
> HTTP ERROR: 404
>
> Problem accessing /doc/duccbook.html. Reason:
>
> Not Found
>
>
> Thanks and Regards,
> Satya Nand Kanodia
>
> On 10/05/2015 07:01 PM, Eddie Epstein wrote:
>
>> 0 made a small change in cgconfig.conf, adding two lines to enable
>>
>
>


Re: C-Groups status remains off in web server after installing C-Groups

2015-10-05 Thread Eddie Epstein
Satya,

DUCC v2.0 made a small change in cgconfig.conf, adding two lines to enable
CPU control:
  cpu = /cgroup;
and
  cpu {}

Contents on the centos machine /cgroups are missing all of the "cpu.*"
entries.
The DUCC v2.0 agent does require cpu support to enable cgroup support.

Eddie


On Mon, Oct 5, 2015 at 12:10 AM, Satya Nand Kanodia <
satya.kano...@orkash.com> wrote:

> Dear Eddie,
>
>
> Below are the contents of /cgroup on centos machine.
>
> cgroup.event_control   memory.max_usage_in_bytes memory.oom_control
>   notify_on_release
> cgroup.procs   memory.memsw.failcnt memory.soft_limit_in_bytes
> release_agent
> ducc   memory.memsw.limit_in_bytes memory.stat
>  tasks
> memory.failcnt memory.memsw.max_usage_in_bytes memory.swappiness
> memory.force_empty memory.memsw.usage_in_bytes memory.usage_in_bytes
> memory.limit_in_bytes  memory.move_charge_at_immigrate memory.use_hierarchy
>
> these are the permissions on /cgroup/ducc
>
> drwxr-xr-x 2 ducc root 0 Oct  5 09:29 .
>
>
> Thanks and Regards,
> Satya Nand Kanodia
>
> On 10/01/2015 07:49 PM, Eddie Epstein wrote:
>
>> FYI, below are the contents of /groups on a SLES 11.2 machine:
>>
>> ~$ ls /cgroup
>> cgroup.clone_children  cpu.rt_runtime_us   memory.limit_in_bytes
>> memory.statsysdefault
>> cgroup.event_control   cpu.shares  memory.max_usage_in_bytes
>> memory.swappiness  tasks
>> cgroup.procs   cpu.statmemory.move_charge_at_immigrate
>> memory.usage_in_bytes
>> cpu.cfs_period_us  duccmemory.numa_stat
>> memory.use_hierarchy
>> cpu.cfs_quota_us   memory.failcnt  memory.oom_control
>> notify_on_release
>> cpu.rt_period_us   memory.force_empty  memory.soft_limit_in_bytes
>> release_agent
>>
>> ~$ ls -ld /cgroup/ducc/
>> drwxr-xr-x 2 ducc root 0 Sep  5 11:31 /cgroup/ducc/
>>
>>
>> On Thu, Oct 1, 2015 at 8:20 AM, Eddie Epstein 
>> wrote:
>>
>> Well, please list the contents of /cgroups to confirm that the custom
>>> cgconfig.conf is operating.
>>> Eddie
>>>
>>> On Thu, Oct 1, 2015 at 12:40 AM, Satya Nand Kanodia <
>>> satya.kano...@orkash.com> wrote:
>>>
>>> Hi Eddie,
>>>>
>>>> I had copied the same contents in  cgconfig.conf.(as it was also written
>>>> in documentation.)
>>>>
>>>> anything else ?
>>>>
>>>> Thanks and Regards,
>>>> Satya Nand Kanodia
>>>>
>>>> On 09/30/2015 05:28 PM, Eddie Epstein wrote:
>>>>
>>>> Hi Satya,
>>>>>
>>>>> There is a custom cgconfig.conf that has to be installed in /etc/
>>>>> before
>>>>> starting the cgconfig service. Please see step 2 in the section
>>>>> "CGroups
>>>>> Installation and Configuration". The custom config is repeated below.
>>>>> Regards, Eddie
>>>>>
>>>>>  # Mount cgroups
>>>>>  mount {
>>>>> memory = /cgroup;
>>>>> cpu = /cgroup;
>>>>>  }
>>>>>  # Define cgroup ducc and setup permissions
>>>>>  group ducc {
>>>>>   perm {
>>>>>   task {
>>>>>      uid = ducc;
>>>>>   }
>>>>>   admin {
>>>>>  uid = ducc;
>>>>>   }
>>>>>   }
>>>>>   memory {}
>>>>>   cpu{}
>>>>>  }
>>>>>
>>>>> On Wed, Sep 30, 2015 at 12:43 AM, Satya Nand Kanodia <
>>>>> satya.kano...@orkash.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> DUCC is running with the ducc user.
>>>>>> why did you say "DUCC's /etc/cgconfig.conf specifies user=ducc to
>>>>>> create
>>>>>> cgroups."? As I installed C-Groups with sudo yum, It has root as
>>>>>> owner.
>>>>>> Do
>>>>>> I need to change it's owner or permissions. It is having currently 644
>>>>>> permissions.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Satya Nand Kanodia
>>>>>>
>>>>>> On 09/29/2015 06:46 PM, Eddie Epstein wrote:
>>>>>>
>>>>>> DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
>>>>>>
>>>>>>> Is DUCC running as user=ducc?
>>>>>>>
>>>>>>> Using sudo for cgreate testing suggests that the ducc userid is not
>>>>>>> being
>>>>>>> used.
>>>>>>>
>>>>>>> Eddie
>>>>>>>
>>>>>>> On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
>>>>>>> satya.kano...@orkash.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am using CentOS release 6.6 for DUCC installation. I did all
>>>>>>>> according
>>>>>>>> to documentation to enable C-Groups.
>>>>>>>> Following command executed without any error.( I had to execute it
>>>>>>>> using
>>>>>>>> sudo.)
>>>>>>>>
>>>>>>>> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>>>>>>>>
>>>>>>>> But on webserver in machines section , it is showing *off* status
>>>>>>>> under
>>>>>>>> the C-Groups.
>>>>>>>>
>>>>>>>> I don't know what went wrong.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and Regards,
>>>>>>>> Satya Nand Kanodia
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>


Re: C-Groups status remains off in web server after installing C-Groups

2015-10-01 Thread Eddie Epstein
FYI, below are the contents of /groups on a SLES 11.2 machine:

~$ ls /cgroup
cgroup.clone_children  cpu.rt_runtime_us   memory.limit_in_bytes
memory.statsysdefault
cgroup.event_control   cpu.shares  memory.max_usage_in_bytes
memory.swappiness  tasks
cgroup.procs   cpu.statmemory.move_charge_at_immigrate
memory.usage_in_bytes
cpu.cfs_period_us  duccmemory.numa_stat
memory.use_hierarchy
cpu.cfs_quota_us   memory.failcnt  memory.oom_control
notify_on_release
cpu.rt_period_us   memory.force_empty  memory.soft_limit_in_bytes
release_agent

~$ ls -ld /cgroup/ducc/
drwxr-xr-x 2 ducc root 0 Sep  5 11:31 /cgroup/ducc/


On Thu, Oct 1, 2015 at 8:20 AM, Eddie Epstein  wrote:

> Well, please list the contents of /cgroups to confirm that the custom
> cgconfig.conf is operating.
> Eddie
>
> On Thu, Oct 1, 2015 at 12:40 AM, Satya Nand Kanodia <
> satya.kano...@orkash.com> wrote:
>
>> Hi Eddie,
>>
>> I had copied the same contents in  cgconfig.conf.(as it was also written
>> in documentation.)
>>
>> anything else ?
>>
>> Thanks and Regards,
>> Satya Nand Kanodia
>>
>> On 09/30/2015 05:28 PM, Eddie Epstein wrote:
>>
>>> Hi Satya,
>>>
>>> There is a custom cgconfig.conf that has to be installed in /etc/ before
>>> starting the cgconfig service. Please see step 2 in the section "CGroups
>>> Installation and Configuration". The custom config is repeated below.
>>> Regards, Eddie
>>>
>>> # Mount cgroups
>>> mount {
>>>memory = /cgroup;
>>>cpu = /cgroup;
>>> }
>>> # Define cgroup ducc and setup permissions
>>> group ducc {
>>>  perm {
>>>  task {
>>> uid = ducc;
>>>  }
>>>  admin {
>>> uid = ducc;
>>>  }
>>>  }
>>>  memory {}
>>>  cpu{}
>>> }
>>>
>>> On Wed, Sep 30, 2015 at 12:43 AM, Satya Nand Kanodia <
>>> satya.kano...@orkash.com> wrote:
>>>
>>> Hi,
>>>>
>>>> DUCC is running with the ducc user.
>>>> why did you say "DUCC's /etc/cgconfig.conf specifies user=ducc to create
>>>> cgroups."? As I installed C-Groups with sudo yum, It has root as owner.
>>>> Do
>>>> I need to change it's owner or permissions. It is having currently 644
>>>> permissions.
>>>>
>>>> Thanks and Regards,
>>>> Satya Nand Kanodia
>>>>
>>>> On 09/29/2015 06:46 PM, Eddie Epstein wrote:
>>>>
>>>> DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
>>>>> Is DUCC running as user=ducc?
>>>>>
>>>>> Using sudo for cgreate testing suggests that the ducc userid is not
>>>>> being
>>>>> used.
>>>>>
>>>>> Eddie
>>>>>
>>>>> On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
>>>>> satya.kano...@orkash.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> I am using CentOS release 6.6 for DUCC installation. I did all
>>>>>> according
>>>>>> to documentation to enable C-Groups.
>>>>>> Following command executed without any error.( I had to execute it
>>>>>> using
>>>>>> sudo.)
>>>>>>
>>>>>> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>>>>>>
>>>>>> But on webserver in machines section , it is showing *off* status
>>>>>> under
>>>>>> the C-Groups.
>>>>>>
>>>>>> I don't know what went wrong.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Satya Nand Kanodia
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>
>


Re: C-Groups status remains off in web server after installing C-Groups

2015-10-01 Thread Eddie Epstein
Well, please list the contents of /cgroups to confirm that the custom
cgconfig.conf is operating.
Eddie

On Thu, Oct 1, 2015 at 12:40 AM, Satya Nand Kanodia <
satya.kano...@orkash.com> wrote:

> Hi Eddie,
>
> I had copied the same contents in  cgconfig.conf.(as it was also written
> in documentation.)
>
> anything else ?
>
> Thanks and Regards,
> Satya Nand Kanodia
>
> On 09/30/2015 05:28 PM, Eddie Epstein wrote:
>
>> Hi Satya,
>>
>> There is a custom cgconfig.conf that has to be installed in /etc/ before
>> starting the cgconfig service. Please see step 2 in the section "CGroups
>> Installation and Configuration". The custom config is repeated below.
>> Regards, Eddie
>>
>> # Mount cgroups
>> mount {
>>memory = /cgroup;
>>cpu = /cgroup;
>> }
>> # Define cgroup ducc and setup permissions
>> group ducc {
>>  perm {
>>  task {
>> uid = ducc;
>>  }
>>  admin {
>> uid = ducc;
>>  }
>>  }
>>  memory {}
>>  cpu{}
>> }
>>
>> On Wed, Sep 30, 2015 at 12:43 AM, Satya Nand Kanodia <
>> satya.kano...@orkash.com> wrote:
>>
>> Hi,
>>>
>>> DUCC is running with the ducc user.
>>> why did you say "DUCC's /etc/cgconfig.conf specifies user=ducc to create
>>> cgroups."? As I installed C-Groups with sudo yum, It has root as owner.
>>> Do
>>> I need to change it's owner or permissions. It is having currently 644
>>> permissions.
>>>
>>> Thanks and Regards,
>>> Satya Nand Kanodia
>>>
>>> On 09/29/2015 06:46 PM, Eddie Epstein wrote:
>>>
>>> DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
>>>> Is DUCC running as user=ducc?
>>>>
>>>> Using sudo for cgreate testing suggests that the ducc userid is not
>>>> being
>>>> used.
>>>>
>>>> Eddie
>>>>
>>>> On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
>>>> satya.kano...@orkash.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>>> I am using CentOS release 6.6 for DUCC installation. I did all
>>>>> according
>>>>> to documentation to enable C-Groups.
>>>>> Following command executed without any error.( I had to execute it
>>>>> using
>>>>> sudo.)
>>>>>
>>>>> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>>>>>
>>>>> But on webserver in machines section , it is showing *off* status under
>>>>> the C-Groups.
>>>>>
>>>>> I don't know what went wrong.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Satya Nand Kanodia
>>>>>
>>>>>
>>>>>
>>>>>
>


Re: C-Groups status remains off in web server after installing C-Groups

2015-09-30 Thread Eddie Epstein
Hi Satya,

There is a custom cgconfig.conf that has to be installed in /etc/ before
starting the cgconfig service. Please see step 2 in the section "CGroups
Installation and Configuration". The custom config is repeated below.
Regards, Eddie

   # Mount cgroups
   mount {
  memory = /cgroup;
  cpu = /cgroup;
   }
   # Define cgroup ducc and setup permissions
   group ducc {
perm {
task {
   uid = ducc;
}
admin {
   uid = ducc;
}
}
memory {}
cpu{}
   }

On Wed, Sep 30, 2015 at 12:43 AM, Satya Nand Kanodia <
satya.kano...@orkash.com> wrote:

> Hi,
>
> DUCC is running with the ducc user.
> why did you say "DUCC's /etc/cgconfig.conf specifies user=ducc to create
> cgroups."? As I installed C-Groups with sudo yum, It has root as owner. Do
> I need to change it's owner or permissions. It is having currently 644
> permissions.
>
> Thanks and Regards,
> Satya Nand Kanodia
>
> On 09/29/2015 06:46 PM, Eddie Epstein wrote:
>
>> DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
>> Is DUCC running as user=ducc?
>>
>> Using sudo for cgreate testing suggests that the ducc userid is not being
>> used.
>>
>> Eddie
>>
>> On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
>> satya.kano...@orkash.com> wrote:
>>
>> Hi,
>>>
>>> I am using CentOS release 6.6 for DUCC installation. I did all according
>>> to documentation to enable C-Groups.
>>> Following command executed without any error.( I had to execute it using
>>> sudo.)
>>>
>>> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>>>
>>> But on webserver in machines section , it is showing *off* status under
>>> the C-Groups.
>>>
>>> I don't know what went wrong.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Satya Nand Kanodia
>>>
>>>
>>>
>


Re: C-Groups status remains off in web server after installing C-Groups

2015-09-29 Thread Eddie Epstein
DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
Is DUCC running as user=ducc?

Using sudo for cgreate testing suggests that the ducc userid is not being
used.

Eddie

On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
satya.kano...@orkash.com> wrote:

> Hi,
>
> I am using CentOS release 6.6 for DUCC installation. I did all according
> to documentation to enable C-Groups.
> Following command executed without any error.( I had to execute it using
> sudo.)
>
> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>
> But on webserver in machines section , it is showing *off* status under
> the C-Groups.
>
> I don't know what went wrong.
>
>
>
> --
> Thanks and Regards,
> Satya Nand Kanodia
>
>


Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein
One way to allow a delegate to terminate subsequent flow of a primary CAS
would be for the built-in UIMA flow controller to assign FinalStep() to
CASes
containing some "drop-cas-mark".

Assuming your OUTER_AAE is not using a custom flow controller, this would
allow
the OUTER_AAE to respect a mark set by the delegate INNER_AAE without
changing
any code in the OUTER.

The next problem would be establishing a convention for the drop-cas-mark.


On Mon, Sep 7, 2015 at 12:34 PM, Richard Eckart de Castilho 
wrote:

> I don't think that cas.deleteView() would be a clean solution unless UIMA
> would be default drop any CAS that has its only remaining view removed.
>
> Dropping the whole unit-of-work (the CAS) instead of stripping its content
> appear to me a cleaner solution.
>
> -- Richard
>
> On 07.09.2015, at 17:45, Eddie Epstein  wrote:
>
> > There was a Jira opened 7 years ago to support a cas.deleteView() method,
> > but it has been ignored due to lack of interest.
> > See https://issues.apache.org/jira/browse/UIMA-830
> >
> > Eddie
> >
> > On Mon, Sep 7, 2015 at 11:20 AM, Zesch, Torsten <
> torsten.ze...@uni-due.de>
> > wrote:
> >
> >> Only if it could completely empty the CAS including the document text,
> but
> >> as far as I know the document text cannot be changed once it is set.
> >>
> >> Am 07/09/15 17:14 schrieb "Eddie Epstein" unter :
> >>
> >>> Can the filter in the INNER_AAE modify such CASes, perhaps
> >>> by deleting data, that would result in the existing consumer
> >>> effectively ignoring them?
> >>>
> >>> On Mon, Sep 7, 2015 at 11:08 AM, Zesch, Torsten <
> torsten.ze...@uni-due.de
> >>>
> >>> wrote:
> >>>
> >>>>> The consumer does not have to be modified if the flow controller
> >>>>> drops CASes marked to be ignored.
> >>>>>
> >>>>> Sounds like the issue in this case is that the consumer is in the
> >>>>> OUTER_AAE, and there is a desire not to have any components
> >>>>> in the OUTER_AAE be aware of the filtering operation.
> >>>>> Is this right?
> >>>>
> >>>> Yes exactly.
> >>>>
> >>>> -Torsten
> >>>>
> >>>>
> >>
> >>
>
>


Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein
There was a Jira opened 7 years ago to support a cas.deleteView() method,
but it has been ignored due to lack of interest.
See https://issues.apache.org/jira/browse/UIMA-830

Eddie

On Mon, Sep 7, 2015 at 11:20 AM, Zesch, Torsten 
wrote:

> Only if it could completely empty the CAS including the document text, but
> as far as I know the document text cannot be changed once it is set.
>
> Am 07/09/15 17:14 schrieb "Eddie Epstein" unter :
>
> >Can the filter in the INNER_AAE modify such CASes, perhaps
> >by deleting data, that would result in the existing consumer
> >effectively ignoring them?
> >
> >On Mon, Sep 7, 2015 at 11:08 AM, Zesch, Torsten  >
> >wrote:
> >
> >> >The consumer does not have to be modified if the flow controller
> >> >drops CASes marked to be ignored.
> >> >
> >> >Sounds like the issue in this case is that the consumer is in the
> >> >OUTER_AAE, and there is a desire not to have any components
> >> >in the OUTER_AAE be aware of the filtering operation.
> >> >Is this right?
> >>
> >> Yes exactly.
> >>
> >> -Torsten
> >>
> >>
>
>


Re: CAS merger/multiplier N:M mapping

2015-09-07 Thread Eddie Epstein
Petr,


> > >   (I'm somewhat tempted to cut my losses short (much too late) and
> > > abandon UIMA flow control altogether, using only simple pipelines and
> > > having custom glue code to connect these together, as it seems like
> > > getting the flow to work in interesting cases is a huge time sink and
> in
> > > retrospect, it could never pay off any abstract advantage of easier
> > > distributed processing (where you probably end up having to chop up the
> > > pipeline manually anyway).  I would probably never recommend new UIMA
> > > users to strive for a single pipeline with CAS multipliers/mergers and
> > > begin to consider these features an evolutionary dead end rather than
> > > advantageous.  Not sure if there even *are* any other real users using
> > > advanced flows besides me and DeepQA.  I'll be glad to hear any
> opinions
> > > on this!)
> > >
> > >
> > Definitely the advantage to encapsulating analytics in standard UIMA
> > components is easy scalability via the vertical and horizontal scale out
> > options offered by UIMA-AS and DUCC. Flexibility in chopping up a
> > pipeline into services as needed is another advantage.
>
>   But as far as I understand, you need to explicitly define and deploy
> AEs that are to be run on different machines anyway.  So I'm not sure if
> the extra value is really that large in the end?
>
>
Well, yes. But with DUCC only the definition needs be explicitly done;
the deployment and replicated scale out of all components are done
automatically.

Eddie


Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein
Can the filter in the INNER_AAE modify such CASes, perhaps
by deleting data, that would result in the existing consumer
effectively ignoring them?

On Mon, Sep 7, 2015 at 11:08 AM, Zesch, Torsten 
wrote:

> >The consumer does not have to be modified if the flow controller
> >drops CASes marked to be ignored.
> >
> >Sounds like the issue in this case is that the consumer is in the
> >OUTER_AAE, and there is a desire not to have any components
> >in the OUTER_AAE be aware of the filtering operation.
> >Is this right?
>
> Yes exactly.
>
> -Torsten
>
>


Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein
The consumer does not have to be modified if the flow controller
drops CASes marked to be ignored.

Sounds like the issue in this case is that the consumer is in the
OUTER_AAE, and there is a desire not to have any components
in the OUTER_AAE be aware of the filtering operation.
Is this right?

Eddie


On Sun, Sep 6, 2015 at 3:33 PM, Zesch, Torsten 
wrote:

> Thanks for your input.
>
> To give some more information about our use case:
> Our input is a mix of documents.
> Only some of them are relevant and should be written by the consumer.
> We also thought about the solution with a special FeatureStructure, but
> this has the disadvantage that the consumer needs to be aware of that.
> It would be easier if some CASes could simply be dropped.
> I guess this could even be useful for flat workflows.
>
> -Torsten
>
>
> Am 06/09/15 17:31 schrieb "Eddie Epstein" unter :
>
> >Keeping the filter inside the INNER may still be useful to
> >terminate any further processing in that AAE.
> >
> >outputsNewCases=true is just saying that an aggregate is
> >a CasMultiplier and *might* return child-CASes. It doesn't
> >change the CAS-in/CAS-out contract for the component.
> >
> >I think a fair amount of logic would have to be reworked
> >if that contract were changed. For sure in UIMA-AS,
> >where supporting CM services is one of the more complex
> >design issues. But maybe it would be interesting to see
> >the pros vs cons of making that change.
> >
> >Eddie
> >
> >
> >On Sun, Sep 6, 2015 at 11:20 AM, Richard Eckart de Castilho
> >
> >wrote:
> >
> >> That would require that the OUTER_AAE is aware of the filtering.
> >> We would prefer if all customization/filtering/etc. could be done in the
> >> INNER_AAE which is the declared extension point.
> >>
> >> In the worst case, we'd probably opt to move the FILTER from to
> >> the OUTER_AAE entirely and make filtering a default option.
> >>
> >> My assumption would be that the OUTER_AAE should not have a problem
> >> if the INNER_AAE drops anything if INNER_AAE declares
> >>outputsNewCases=true.
> >> But obviously that assumption is wrong - I/we just don't get why.
> >>
> >> Cheers,
> >>
> >> -- Richard
> >>
> >> On 06.09.2015, at 17:14, Eddie Epstein  wrote:
> >>
> >> > How about the filter adds a FeatureStructure indicating that the CAS
> >> should
> >> > be dropped.
> >> > Then when the INNER_AAE returns the CAS, the flow controller in the
> >> > OUTER_AAE
> >> > sends the CAS to FinalStep?
> >> >
> >> > Eddie
> >> >
> >> > On Sun, Sep 6, 2015 at 11:08 AM, Richard Eckart de Castilho <
> >> r...@apache.org>
> >> > wrote:
> >> >
> >> >> Eddie,
> >> >>
> >> >> we (Torsten and I) have the case that a reader produces a number of
> >> CASes
> >> >> and we want to filter out some of them because they do not match a
> >>given
> >> >> criteria.
> >> >>
> >> >> The pipeline/flow structure we are using looks as follows:
> >> >>
> >> >> READER -> OUTER_AAE { AEs..., INNER_AAE { FILTER }, AEs..., CONSUMER
> >>}
> >> >>
> >> >> READER, OUTER_AAE, AEs and CONSUMER are assumed to be fixed.
> >> >>
> >> >> INNER_AAE is meant to be an extension point and the FILTER inside it
> >> >> is meant to remove all CASes that do not match our criteria such
> >> >> that those do not reach the CONSUMER.
> >> >>
> >> >> So we do explicitly not want certain CASes to continue the processing
> >> path.
> >> >>
> >> >> -- Richard
> >> >>
> >> >> On 06.09.2015, at 17:04, Eddie Epstein  wrote:
> >> >>
> >> >>> Richard,
> >> >>>
> >> >>> In general the input CAS must continue down some processing path.
> >> >>> Where is it stored and what triggers its continued processing if it
> >>is
> >> >> not
> >> >>> returned?
> >> >>>
> >> >>> Eddie
> >> >>>
> >> >>> On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho <
> >> >> r...@apache.org>
> >> >>> wrote:
> >> >>>
> >> >>>> Hi E

Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein
Keeping the filter inside the INNER may still be useful to
terminate any further processing in that AAE.

outputsNewCases=true is just saying that an aggregate is
a CasMultiplier and *might* return child-CASes. It doesn't
change the CAS-in/CAS-out contract for the component.

I think a fair amount of logic would have to be reworked
if that contract were changed. For sure in UIMA-AS,
where supporting CM services is one of the more complex
design issues. But maybe it would be interesting to see
the pros vs cons of making that change.

Eddie


On Sun, Sep 6, 2015 at 11:20 AM, Richard Eckart de Castilho 
wrote:

> That would require that the OUTER_AAE is aware of the filtering.
> We would prefer if all customization/filtering/etc. could be done in the
> INNER_AAE which is the declared extension point.
>
> In the worst case, we'd probably opt to move the FILTER from to
> the OUTER_AAE entirely and make filtering a default option.
>
> My assumption would be that the OUTER_AAE should not have a problem
> if the INNER_AAE drops anything if INNER_AAE declares outputsNewCases=true.
> But obviously that assumption is wrong - I/we just don't get why.
>
> Cheers,
>
> -- Richard
>
> On 06.09.2015, at 17:14, Eddie Epstein  wrote:
>
> > How about the filter adds a FeatureStructure indicating that the CAS
> should
> > be dropped.
> > Then when the INNER_AAE returns the CAS, the flow controller in the
> > OUTER_AAE
> > sends the CAS to FinalStep?
> >
> > Eddie
> >
> > On Sun, Sep 6, 2015 at 11:08 AM, Richard Eckart de Castilho <
> r...@apache.org>
> > wrote:
> >
> >> Eddie,
> >>
> >> we (Torsten and I) have the case that a reader produces a number of
> CASes
> >> and we want to filter out some of them because they do not match a given
> >> criteria.
> >>
> >> The pipeline/flow structure we are using looks as follows:
> >>
> >> READER -> OUTER_AAE { AEs..., INNER_AAE { FILTER }, AEs..., CONSUMER }
> >>
> >> READER, OUTER_AAE, AEs and CONSUMER are assumed to be fixed.
> >>
> >> INNER_AAE is meant to be an extension point and the FILTER inside it
> >> is meant to remove all CASes that do not match our criteria such
> >> that those do not reach the CONSUMER.
> >>
> >> So we do explicitly not want certain CASes to continue the processing
> path.
> >>
> >> -- Richard
> >>
> >> On 06.09.2015, at 17:04, Eddie Epstein  wrote:
> >>
> >>> Richard,
> >>>
> >>> In general the input CAS must continue down some processing path.
> >>> Where is it stored and what triggers its continued processing if it is
> >> not
> >>> returned?
> >>>
> >>> Eddie
> >>>
> >>> On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho <
> >> r...@apache.org>
> >>> wrote:
> >>>
> >>>> Hi Eddie,
> >>>>
> >>>> in most cases, we use process(CAS) and in such a case what you
> describe
> >>>> is very logical.
> >>>>
> >>>> However, when setting outputsNewCases to true, doesn't the contract
> >> change?
> >>>> My understanding is that processAndOutputNewCASes(CAS) is being
> >>>> used and in such a case. Why shouldn't it be ok that the iterator
> >>>> returned by processAndOutputNewCASes does not contain the input CAS?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -- Richard
> >>>>
> >>>> On 06.09.2015, at 16:21, Eddie Epstein  wrote:
> >>>>
> >>>>> Hi Richard,
> >>>>>
> >>>>> FinalStep() in a CasMultiplier aggregate means to stop further flow
> >>>>> in the aggregate and return the CAS to the component that passed
> >>>>> the CAS into the aggregate, or if a child-CAS, passed the child's
> >>>>> parent-CAS into the aggregate.
> >>>>>
> >>>>> FinalStep(true) is used to stop a child-CAS from being returned
> >>>>> to the component. But the contract for an AE is CAS-in/CAS-out,
> >>>>> which means a CAS coming into an AE must be returned.
> >>>>>
> >>>>> Eddie
> >>>>>
> >>>>> On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho <
> >>>> r...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Eddie,
> >>>>>>
> >>>>>> ok, but why can input CASes created outside the aggregate not be
> >>>> dropped?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> -- Richard
> >>>>
> >>>>
> >>
> >>
>
>


Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein
How about the filter adds a FeatureStructure indicating that the CAS should
be dropped.
Then when the INNER_AAE returns the CAS, the flow controller in the
OUTER_AAE
sends the CAS to FinalStep?

Eddie

On Sun, Sep 6, 2015 at 11:08 AM, Richard Eckart de Castilho 
wrote:

> Eddie,
>
> we (Torsten and I) have the case that a reader produces a number of CASes
> and we want to filter out some of them because they do not match a given
> criteria.
>
> The pipeline/flow structure we are using looks as follows:
>
> READER -> OUTER_AAE { AEs..., INNER_AAE { FILTER }, AEs..., CONSUMER }
>
> READER, OUTER_AAE, AEs and CONSUMER are assumed to be fixed.
>
> INNER_AAE is meant to be an extension point and the FILTER inside it
> is meant to remove all CASes that do not match our criteria such
> that those do not reach the CONSUMER.
>
> So we do explicitly not want certain CASes to continue the processing path.
>
> -- Richard
>
> On 06.09.2015, at 17:04, Eddie Epstein  wrote:
>
> > Richard,
> >
> > In general the input CAS must continue down some processing path.
> > Where is it stored and what triggers its continued processing if it is
> not
> > returned?
> >
> > Eddie
> >
> > On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho <
> r...@apache.org>
> > wrote:
> >
> >> Hi Eddie,
> >>
> >> in most cases, we use process(CAS) and in such a case what you describe
> >> is very logical.
> >>
> >> However, when setting outputsNewCases to true, doesn't the contract
> change?
> >> My understanding is that processAndOutputNewCASes(CAS) is being
> >> used and in such a case. Why shouldn't it be ok that the iterator
> >> returned by processAndOutputNewCASes does not contain the input CAS?
> >>
> >> Cheers,
> >>
> >> -- Richard
> >>
> >> On 06.09.2015, at 16:21, Eddie Epstein  wrote:
> >>
> >>> Hi Richard,
> >>>
> >>> FinalStep() in a CasMultiplier aggregate means to stop further flow
> >>> in the aggregate and return the CAS to the component that passed
> >>> the CAS into the aggregate, or if a child-CAS, passed the child's
> >>> parent-CAS into the aggregate.
> >>>
> >>> FinalStep(true) is used to stop a child-CAS from being returned
> >>> to the component. But the contract for an AE is CAS-in/CAS-out,
> >>> which means a CAS coming into an AE must be returned.
> >>>
> >>> Eddie
> >>>
> >>> On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho <
> >> r...@apache.org>
> >>> wrote:
> >>>
> >>>> Hi Eddie,
> >>>>
> >>>> ok, but why can input CASes created outside the aggregate not be
> >> dropped?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -- Richard
> >>
> >>
>
>


Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein
Richard,

In general the input CAS must continue down some processing path.
Where is it stored and what triggers its continued processing if it is not
returned?

Eddie

On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho 
wrote:

> Hi Eddie,
>
> in most cases, we use process(CAS) and in such a case what you describe
> is very logical.
>
> However, when setting outputsNewCases to true, doesn't the contract change?
> My understanding is that processAndOutputNewCASes(CAS) is being
> used and in such a case. Why shouldn't it be ok that the iterator
> returned by processAndOutputNewCASes does not contain the input CAS?
>
> Cheers,
>
> -- Richard
>
> On 06.09.2015, at 16:21, Eddie Epstein  wrote:
>
> > Hi Richard,
> >
> > FinalStep() in a CasMultiplier aggregate means to stop further flow
> > in the aggregate and return the CAS to the component that passed
> > the CAS into the aggregate, or if a child-CAS, passed the child's
> > parent-CAS into the aggregate.
> >
> > FinalStep(true) is used to stop a child-CAS from being returned
> > to the component. But the contract for an AE is CAS-in/CAS-out,
> > which means a CAS coming into an AE must be returned.
> >
> > Eddie
> >
> > On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho <
> r...@apache.org>
> > wrote:
> >
> >> Hi Eddie,
> >>
> >> ok, but why can input CASes created outside the aggregate not be
> dropped?
> >>
> >> Cheers,
> >>
> >> -- Richard
>
>


Re: CAS merger/multiplier N:M mapping

2015-09-06 Thread Eddie Epstein
Hi Petr

On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis  wrote:

>   Hi!
>
>   I'm currently struggling to perform a complex flow transformation with
> UIMA.  I have multiple (N) CASes with some fulltext search results.
> I chop these search results to sentences and would like to pick the top
> M sentences from the search results collected and build CASes from them
> to do further analysis.  So, I'd like to copy subsets (document text
> wise and annotation wise) of N input CASes to M output CASes.  I don't
> know how to do this technically.  I tried two non-workable ideas so far:
>
>   (i) Keep around references to the respective views of input CASes
> and use them as CasCopier sources when the time comes to produce
> the new CASes.  Turns out the input CASes are (unsurprisingly) recycled
> and the references I kept around at process() time aren't valid when
> next() is called much later.
>
>   (ii) Use an internal "intermediary" CAS instance in process() to which
> I append my sentences, then use it as a source of output CASes.  Turns
> out (surprisingly) that I can't append to a sofa documenttext ("Data for
> Sofa feature setLocalSofaData() has already been set." - not sure about
> the reason for this restriction).
>

The Sofa data for a view is immutable, otherwise existing annotations
could become invalid.


>
>   I think the only choice except downright unmaintainable hacks (like
> programatically generated M views) is to just give up on preserving my
> annotations and carry over just the sentence texts.  Am I missing
> something?
>

Creating a new view in the intermediate CAS for each of the N input CASes
would work. A new output CAS Sofa would be comprised of data from
multiple views and of course the annotation end points adjusted as when
added to the new output CAS.

One problem there is that the intermediate CAS would continue to grow
in size, so there would need to be some point when it could be reset.


>
>   (I'm somewhat tempted to cut my losses short (much too late) and
> abandon UIMA flow control altogether, using only simple pipelines and
> having custom glue code to connect these together, as it seems like
> getting the flow to work in interesting cases is a huge time sink and in
> retrospect, it could never pay off any abstract advantage of easier
> distributed processing (where you probably end up having to chop up the
> pipeline manually anyway).  I would probably never recommend new UIMA
> users to strive for a single pipeline with CAS multipliers/mergers and
> begin to consider these features an evolutionary dead end rather than
> advantageous.  Not sure if there even *are* any other real users using
> advanced flows besides me and DeepQA.  I'll be glad to hear any opinions
> on this!)
>
>
Definitely the advantage to encapsulating analytics in standard UIMA
components is easy scalability via the vertical and horizontal scale out
options offered by UIMA-AS and DUCC. Flexibility in chopping up a
pipeline into services as needed is another advantage.

The previously mentioned GALE multimodal application also converted
sequences of N input CASes to M output CASes. In that case the input
CASes represented 2 minutes worth of speech-to-text transcription of
broadcast news, and each output CAS represented a single news story.
The story-CASes then went thru a pipeline that identified the story and
updated a pre-existing summarization for each story.

Eddie

--
> Petr Baudis
> If you have good ideas, good data and fast computers,
> you can do almost anything. -- Geoffrey Hinton
>


Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein
Hi Richard,

FinalStep() in a CasMultiplier aggregate means to stop further flow
in the aggregate and return the CAS to the component that passed
the CAS into the aggregate, or if a child-CAS, passed the child's
parent-CAS into the aggregate.

FinalStep(true) is used to stop a child-CAS from being returned
to the component. But the contract for an AE is CAS-in/CAS-out,
which means a CAS coming into an AE must be returned.

Eddie

On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho 
wrote:

> Hi Eddie,
>
> ok, but why can input CASes created outside the aggregate not be dropped?
>
> Cheers,
>
> -- Richard
>
> On 06.09.2015, at 15:58, Eddie Epstein  wrote:
>
> > Hi Torsten,
> >
> > The documentation says ...
> >
> > public FinalStep(boolean aForceCasToBeDropped)
> >
> >   Creates a new FinalStep, and may indicate that a CAS should be dropped.
> >   This can only be used for CASes that are produced internally to the
> > aggregate.
> >   It is an error to attempt to drop a CAS that was passed as input to the
> > aggregate.
> >
> > The error must be because the drop is being applied to a CAS passed into
> the
> > aggregate from the outside, not created by a CasMultiplier inside the
> > aggregate.
> >
> > Eddie
> >
> >
> > On Wed, Sep 2, 2015 at 4:22 PM, Zesch, Torsten  >
> > wrote:
> >
> >> Hi all,
> >>
> >> I'm trying to implement a FlowController that drops CASes matching
> certain
> >> critera. The FlowController is defined on an inner AAE which sets
> >> casproduced to true. The inner AAE resides in an outer AAE which
> contains
> >> additional processing before and after the inner AAE.
> >>
> >> Reader -> Outer AAE { ProcŠ Inner AAE { FlowController } ProcŠ Consumer}
> >> The aggregate receives various input CASes and is supposed to drop some
> >> but not others. When I try to drop a CAS in my FlowController now, I get
> >> the error
> >>
> >> Caused by:
> org.apache.uima.analysis_engine.AnalysisEngineProcessException:
> >> The FlowController attempted to drop a CAS that was passed as input to
> the
> >> Aggregate AnalysisEngine containing that FlowController.  The only CASes
> >> that may be dropped are those that are created within the same Aggregate
> >> AnalysisEngine as the FlowController.
> >>
> >> How can I drop CASes using a FlowController such that they do not
> proceed
> >> in the outer aggregate?
> >>
> >>
> >> thanks,
> >> Torsten
>
>


Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein
Hi Torsten,

The documentation says ...

public FinalStep(boolean aForceCasToBeDropped)

   Creates a new FinalStep, and may indicate that a CAS should be dropped.
   This can only be used for CASes that are produced internally to the
aggregate.
   It is an error to attempt to drop a CAS that was passed as input to the
aggregate.

The error must be because the drop is being applied to a CAS passed into the
aggregate from the outside, not created by a CasMultiplier inside the
aggregate.

Eddie


On Wed, Sep 2, 2015 at 4:22 PM, Zesch, Torsten 
wrote:

> Hi all,
>
> I'm trying to implement a FlowController that drops CASes matching certain
> critera. The FlowController is defined on an inner AAE which sets
> casproduced to true. The inner AAE resides in an outer AAE which contains
> additional processing before and after the inner AAE.
>
> Reader -> Outer AAE { ProcŠ Inner AAE { FlowController } ProcŠ Consumer}
> The aggregate receives various input CASes and is supposed to drop some
> but not others. When I try to drop a CAS in my FlowController now, I get
> the error
>
> Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException:
> The FlowController attempted to drop a CAS that was passed as input to the
> Aggregate AnalysisEngine containing that FlowController.  The only CASes
> that may be dropped are those that are created within the same Aggregate
> AnalysisEngine as the FlowController.
>
> How can I drop CASes using a FlowController such that they do not proceed
> in the outer aggregate?
>
>
> thanks,
> Torsten
>
>


Re: DUCC multi-node installation. Beginner's questions.

2015-07-23 Thread Eddie Epstein
Hi Sergii,

Thanks much for the suggested documentation change.
DUCC v2.0 is getting close to release and now will have this info.

Eddie

On Thu, Jul 23, 2015 at 3:03 AM, Sergii Poluektov <
sergii.poluek...@googlemail.com> wrote:

> Hi Eddie,
>
> thanks a lot for your promt reply.
> Yesterday I finally managed to set up my nodes.
>
> Maybe it is not a bad idea to edit the documentation a bit, so that it is
> explicitely written, that DUCC has to be installed once on one node, and
> then
> the installation directory has to be mounted on the same spot on all the
> other
> nodes.
> Maybe it will save someone like me a couple of hours.
>
> Thanks again and cheers,
> Sergii
>
> On Wed, Jul 22, 2015 at 3:02 PM, Eddie Epstein 
> wrote:
>
> > Hi Sergii,
> >
> > The ducc_runtime tree needs to be installed on a shared filesystem
> > that all DUCC nodes have mounted in the same location. Just install
> > the ducc runtime once from the DUCC head node. All other DUCC
> > nodes simply need to have the mounted filesystem and common user
> > accounts with identical user and group IDs.
> >
> > An NFS shared filesystem as you describe would be fine, and the
> > same filesystem could be used for providing shared user space,
> > typically the users home directories.
> >
> > Eddie
> >
>


Re: DUCC multi-node installation. Beginner's questions.

2015-07-22 Thread Eddie Epstein
Hi Sergii,

The ducc_runtime tree needs to be installed on a shared filesystem
that all DUCC nodes have mounted in the same location. Just install
the ducc runtime once from the DUCC head node. All other DUCC
nodes simply need to have the mounted filesystem and common user
accounts with identical user and group IDs.

An NFS shared filesystem as you describe would be fine, and the
same filesystem could be used for providing shared user space,
typically the users home directories.

Eddie


Re: UIMAj3 ideas

2015-07-10 Thread Eddie Epstein
Hi Petr,

Good comments which will likely generate lots of responses.
For now please see comments on scaleout below.

On Thu, Jul 9, 2015 at 6:52 PM, Petr Baudis  wrote:

>   * UIMAfit is not part of core UIMA and UIMA-AS is not part of core
> UIMA.  It seems to me that UIMA-AS is doing things a bit differently
> than what the original UIMA idea of doing scaleout was.  The two
> things don't play well together.  I'd love a way to easily take
> my plain UIMA pipeline and scale it out, ideally without any code
> changes, *and* avoid the terrible XML config files.
>
>
Not clear what you are referring to as the "original UIMA idea of doing
scaleout",
the CPE? Core UIMA is a single threaded, embeddable framework. UIMA-AS
is also an embeddable framework that offers flexible vertical
(multi-threading) and
horizontal (multi-process) options for deploying an arbitrary pipeline.
Admittedly
scaleout with UIMA-AS is complicated and the minimal support for process
management make it difficult to do scaleout simply. In what ways do you
think
UIMA-AS is inconsistent with UIMA or UIMA scaleout?

DUCC is full cluster management application that will scaleout a plain UIMA
pipeline with no code changes, assuming that the application code is
threadsafe.
But a typical pipeline with a single collection reader creating input CASes
and
a single cas consumer will limit scaleout performance pretty quickly. DUCC
makes it easyto eliminate the input data bottleneck. DUCC sample apps
show one approach to eliminating the output bottleneck. Have you looked at
DUCC?

Regards,
Eddie


Re: Multi-threaded UIMA ParallelStep

2015-05-20 Thread Eddie Epstein
Right about the flow controller. That's where UIMA-AS comes in. Assuming
that the CM has a casPool with enough CASes, and the aggregate is deployed
asynchronously, then each delegate will be running in its own thread and
can be processing CASes in parallel.

The ASB is a single-theaded controller used for deployment of synchronous
aggregates.

Is the intention here to use parallel processing to reduce latency for a
interactive application or to increase throughput for batch processing? For
throughput, why not just deploy the entire pipeline single-threaded and
then run multiple pipeline instances in separate threads? UIMA-AS would do
this by specifying N instances of a synchronous top-level aggregate.

Eddie


On Wed, May 20, 2015 at 8:49 AM, Petr Baudis  wrote:

>   Hi!
>
> On Wed, May 20, 2015 at 07:56:33AM -0400, Eddie Epstein wrote:
> > Parallel-step currently only works with remote delegates. The other
> > approach, using CasMultipliers, allows an arbitrarily amount of parallel
> > processing in-process. A CM would create a separate CAS for each delegate
> > intended to run in parallel, and use a feature structure to hold a unique
> > identifier in each child CAS which a custom flow controller would use to
> > direct these CASes to the desired delegates. Results for the parallel
> flows
> > could be merged in a CasConsumer back into the parent CAS or to some
> other
> > output.
>
>   Thanks for that hint.  However, I'm not sure how a flow controller
> could direct CASes to delegates?  As far as I understand it, the flow
> controller decides which AE processes the CAS next, but cannot control
> the actual parallel execution of the flow, which would need to be taken
> care by the ASB (Analysis Structure Broker), and that would be the thing
> to hack in this case.  Am I missing something?
>
>   Thanks,
>
> Petr Baudis
>


Re: Multi-threaded UIMA ParallelStep

2015-05-20 Thread Eddie Epstein
Parallel-step currently only works with remote delegates. The other
approach, using CasMultipliers, allows an arbitrarily amount of parallel
processing in-process. A CM would create a separate CAS for each delegate
intended to run in parallel, and use a feature structure to hold a unique
identifier in each child CAS which a custom flow controller would use to
direct these CASes to the desired delegates. Results for the parallel flows
could be merged in a CasConsumer back into the parent CAS or to some other
output.

Some other key concepts here are the CasCopier, which can be used to
efficiently copy large amounts of CAS content from one CAS to another, and
"process-parent-last" which can be specified for a CasMultiplier so that
further processing of a parent CAS will not continue until all of its
children have completed processing.

Eddie


On Tue, May 19, 2015 at 9:27 PM, Petr Baudis  wrote:

>   Hi!
>
>   I'm looking into ways to run a part of my pipeline multi-threaded:
>
> .-> Multip0 -> A1 -> Multip1 -> A2 ->.
>   reader -> A0 <  > CASmerger
> `-> Multip2 -> A3 > A2 ->'
> ^^
> ParallelStep is generated for each branch
> in a custom flow controller
>
> Basically, I need a way to tell UIMA to run each ParallelStep (which
> normally just denotes the CAS flow) truly in parallel.  I have two
> constraints:
>
>   (i) I'm using UIMAfit heavily, and multiple CAS multipliers and
> mergers (even within the parallel branches).  So I can't use CPE.
>
>   (ii) I need multi-threading, not separate processes.  (I have just
> a meager 24G RAM (sigh) and one Java process with all the linguistic
> models and stuff loaded takes 3GB RAM.  So I really need to load these
> resources to memory only once.)
>
>
>   I looked into UIMA-AS, including Richard's helpful DKpro-lab code
> sample, but I can't figure out how to make it reasonably work with
> a *complex* UIMAfit pipeline that spans many branches and many
> analysis engines - it seems to me that I would need some centralized
> places where to specify it, and basically completely rewrite my pipeline
> building code (to the worse, in my impression).
>
>   ...and I'm not even sure, from reading UIMA-AS code, if I could make
> it run in multiple threads within a single process!  From comments in
>
>
> org/apache/uima/aae/controller/AggregateAnalysisEngineController_impl.java:parallelStep()
>
> I'm getting an impression that non-remote AEs will be executed serially
> after all, not in parallel.  Is that correct?
>
>
>   So going back to the original UIMA code, it seems to me that the thing
> to do would be replacing ASB_impl with my own copy (inheritance would
> not cut it the way it's coded), AggregateAnalysisEngine_impl with my own
> specialization or copy (as ASB_impl usage is hardcoded there) and
> rewrite the while() loop in ParallelStep case of ASB's
> processUntilNextOutputCas() to run in parallel.  And hope I didn't miss
> any catch...
>
>
>   Is there an option I'm missing?  Any hints would be really
> appreciated!
>
>   Thanks,
>
> Petr Baudis
>


Re: DUCC- process_dd

2015-05-01 Thread Eddie Epstein
Reshu,

UIMA-AS configurations are normally used in DUCC as Services for
interactive applications or to support Jobs. They can be used in Jobs, but
typically are not.

There is also a difference in the inputs between Job processes and
Services. Services will normally receive a CAS with the artifact to be
analyzed. A Job process will receive a CAS with reference to the artifact
or even collection of artifacts; this is important for Job scale out to
avoid making the Job's Collection Reader a bottleneck.

I suggest starting with one of the sample applications and adapting it to
your needs. We can help if you give some details about the format of the
input and output data.

Eddie

On Fri, May 1, 2015 at 12:31 AM, reshu.agarwal 
wrote:

> Eddie,
>
> I was using this same scenario and doing hit and try to compare this with
> UIMA AS to get the more scaled pipeline as I think UIMA AS can also did
> this. But I am unable to touch the processing time of DUCC's default
> configuration like you mentioned with UIMA AS.
>
> Can you help me in doing this? I just want to do scaling by using best
> configuration of UIMA AS and DUCC which can be done using process_dd. But
> How??
>
> Thanks in advanced.
>
> Reshu.
>
>
> On 05/01/2015 03:28 AM, Eddie Epstein wrote:
>
>> The simplest way of vertically scaling a Job process is to specify the
>> analysis pipeline using core UIMA descriptors and then using
>> --process_thread_count to specify how many copies of the pipeline to
>> deploy, each in a different thread. No use of UIMA-AS at all. Please check
>> out the "Raw Text Processing" sample application that comes with DUCC.
>>
>> On Wed, Apr 29, 2015 at 12:30 AM, reshu.agarwal > >
>> wrote:
>>
>>  Ohh!!! I misunderstand this. I thought this would scale my Aggregate and
>>> AEs both.
>>>
>>> I want to scale aggregate as well as individual AEs. Is there any way of
>>> doing this in UIMA AS/DUCC?
>>>
>>>
>>>
>>> On 04/28/2015 07:14 PM, Jaroslaw Cwiklik wrote:
>>>
>>>  In async aggregate you scale individual AEs not the aggregate as a
>>>> whole.
>>>> The below configuration should do that. Are there any warnings from
>>>> dd2spring at startup with your configuration?
>>>>
>>>> 
>>>>
>>>>   
>>>>   >>> key="ChunkerDescriptor">
>>>>   >>> numberOfInstances="5" />
>>>>   
>>>>   >>> key="NEDescriptor">
>>>>   >>> numberOfInstances="5" />
>>>>   
>>>>   >>> key="StemmerDescriptor">
>>>>   >>> numberOfInstances="5" />
>>>>   
>>>>   >>> key="ConsumerDescriptor">
>>>>   >>> numberOfInstances="5" />
>>>>   
>>>>   
>>>>   
>>>>
>>>> Jerry
>>>>
>>>> On Tue, Apr 28, 2015 at 5:20 AM, reshu.agarwal <
>>>> reshu.agar...@orkash.com>
>>>> wrote:
>>>>
>>>>   Hi,
>>>>
>>>>> I was trying to scale my processing pipeline to be run in DUCC
>>>>> environment
>>>>> with uima as process_dd. If I was trying to scale using the below given
>>>>> configuration, the threads started were not as expected:
>>>>>
>>>>>
>>>>> >>>>   xmlns="http://uima.apache.org/resourceSpecifier";>
>>>>>
>>>>>   Uima v3 Deployment Descripter
>>>>>   Deploys Uima v3 Aggregate AE using the Advanced
>>>>> Fixed
>>>>> Flow
>>>>>   Controller
>>>>>
>>>>>   
>>>>>   
>>>>>   
>>>>>   >>>

Re: DUCC- process_dd

2015-04-30 Thread Eddie Epstein
The simplest way of vertically scaling a Job process is to specify the
analysis pipeline using core UIMA descriptors and then using
--process_thread_count to specify how many copies of the pipeline to
deploy, each in a different thread. No use of UIMA-AS at all. Please check
out the "Raw Text Processing" sample application that comes with DUCC.

On Wed, Apr 29, 2015 at 12:30 AM, reshu.agarwal 
wrote:

>
> Ohh!!! I misunderstand this. I thought this would scale my Aggregate and
> AEs both.
>
> I want to scale aggregate as well as individual AEs. Is there any way of
> doing this in UIMA AS/DUCC?
>
>
>
> On 04/28/2015 07:14 PM, Jaroslaw Cwiklik wrote:
>
>> In async aggregate you scale individual AEs not the aggregate as a whole.
>> The below configuration should do that. Are there any warnings from
>> dd2spring at startup with your configuration?
>>
>> 
>>
>>  
>>  > key="ChunkerDescriptor">
>>  > numberOfInstances="5" />
>>  
>>  > key="NEDescriptor">
>>  > numberOfInstances="5" />
>>  
>>  > key="StemmerDescriptor">
>>  > numberOfInstances="5" />
>>  
>>  > key="ConsumerDescriptor">
>>  > numberOfInstances="5" />
>>  
>>  
>>  
>>
>> Jerry
>>
>> On Tue, Apr 28, 2015 at 5:20 AM, reshu.agarwal 
>> wrote:
>>
>>  Hi,
>>>
>>> I was trying to scale my processing pipeline to be run in DUCC
>>> environment
>>> with uima as process_dd. If I was trying to scale using the below given
>>> configuration, the threads started were not as expected:
>>>
>>>
>>> >>  xmlns="http://uima.apache.org/resourceSpecifier";>
>>>
>>>  Uima v3 Deployment Descripter
>>>  Deploys Uima v3 Aggregate AE using the Advanced
>>> Fixed
>>> Flow
>>>  Controller
>>>
>>>  
>>>  
>>>  
>>>  >> brokerURL="tcp://localhost:61617?jms.useCompression=true" prefetch="0" />
>>>  
>>>  >>
>>> location="../Uima_v3_test/desc/orkash/ae/aggregate/FlowController_Uima.xml"
>>> />
>>>  
>>>  >> key="FlowControllerAgg" internalReplyQueueScaleout="10"
>>> inputQueueScaleout="10">
>>>  
>>>  
>>>  >> key="ChunkerDescriptor">
>>>  >> numberOfInstances="5" />
>>>  
>>>  >> key="NEDescriptor">
>>>  >> numberOfInstances="5" />
>>>  
>>>  >> key="StemmerDescriptor">
>>>  >> numberOfInstances="5" />
>>>  
>>>  >> key="ConsumerDescriptor">
>>>  >> numberOfInstances="5" />
>>>  
>>>  
>>>  
>>>  
>>>  
>>>
>>> 
>>>
>>>
>>> There should be 5 threads of FlowControllerAgg where each thread will
>>> have
>>> 5 more threads of each ChunkerDescriptor,NEDescriptor,StemmerDescriptor
>>> and
>>> ConsumerDescriptor.
>>>
>>> But I didn't think it is actually happening in case of DUCC.
>>>
>>> Thanks in advance.
>>>
>>> Reshu.
>>>
>>>
>>>
>>>
>


Re: UIMA-AS and ActiveMQ ports

2015-04-27 Thread Eddie Epstein
UIMA-AS has example deployment descriptors using placeholders for the
broker: ${defaultBrokerURL}
If these placeholders are used and the user doesn't specify a value for the
Java property "defaultBrokerURL" then some code in UIMA-AS will use a
default value of tcp://localhost:61616. That is the only dependency I am
aware of.
Eddie

On Mon, Apr 27, 2015 at 4:47 PM, D. Heinze  wrote:

> Does UIMA-AS have internal dependencies on ActiveMQ port 61616?  I can
> change my applications to use other ports, but it seems that 61616 still
> needs to be available for something in UIMA-AS.
>
> Thanks / Dan
>
>
>


Re: Error handling in flow control

2015-04-26 Thread Eddie Epstein
Very clear, thanks. A CasMultiplier has the ability to deserialize a CAS
from file and emit it as a child CAS. A parent CAS could have a
FeatureStructure identifying it as one to be rerun from some specific state
(CAS file), the CM would trigger on the FS and produce the child CAS to be
reprocessed, the flow controller configured to return the child from the
aggregate, and the client would then use the child and ignore the parent.

An ideal threading solution would be to use UIMA-AS. Unfortunately a
UIMA-AS service currently requires an AMQ broker for service input and
output. It is possible to embed both broker and service in process, just a
complication and with serialization overhead.

Another thing to consider is to use the relatively new binary compressed
CAS form 6, which can save considerable space over zip compressed XmiCas.
Form 6 has the same ability as XmiCas to be deserialized into a CAS with
different but compatible typesystem.

Hope this helps,
Eddie


On Sat, Apr 25, 2015 at 2:58 AM, Mario Gazzo  wrote:

> My apologies for not being very clear.
>
> I managed to get the basic flow control to work after modifying some AE to
> check for a previous installed sofa before just adding another.
>
> The services I mentioned are not UIMA related but we are migrating
> existing text analysis components to UIMA and these need to integrate with
> a larger existing setup that rely on various AWS services such as S3,
> DynamoDB, Simple Workflow and EMR. We don’t have as such plans to use
> UIMA-AS or Vinci but instead we already use AWS Simple Workflow (SWF) to
> orchestrate all our workers. This means that we just wanted to run multiple
> UIMA pipelines inside some of these workers using multithreaded CPE. I am
> now trying to implement this integration by consuming activity tasks from
> SWF through a collection reader and then have a flow control manage the
> logic and respond back when the AAE pipeline has completed or failed. This
> is where I had problems when experimenting with failure handling.
>
> We are storing output from these workers on S3 and in DynamoDB tables for
> use further downstream in our workflow and online applications. We also
> store intermediate results (snapshots) on S3 so that we can at any point go
> back to a previous step and resume, retry or redo processing but it also
> allows us to inspect data for debugging/analysis purposes. I thought that I
> might be able to do something similar within the CPE using the CAS but this
> isn't that simple. E.g. running the same AE twice against the same CAS
> would result in those annotations occurring twice without carefully
> designing around this. I can still serialize snapshot CAS to XMI on S3 but
> I can’t just load them again in order to restore them back to a previous
> state within the same CPE flow. Instead I would have to fail and initiate a
> retry through SWF, which would cause the previous state to be loaded from
> S3 into a new CAS via the next worker that receives the retry activity task
> through its collection reader. However, storing many snapshot CAS outputs
> will even compressed take a lot more space than the format we are using in
> our production setup now, so I am considering whether there are alternative
> approaches but they so far all appear much more complex and brittle.
>
> Indeed CAS multipliers would be useful for us but the limitations of the
> CPE and the general difficulties I have experienced so far have made me
> consider implementing a custom multithreaded collection processor but I
> wanted to avoid this.
>
> Hope this clarifies what I am trying to do. Cheers :)
>
> > On 24 Apr 2015, at 16:50 , Eddie Epstein  wrote:
> >
> > Can you give more details on the overall pipeline deployment? The initial
> > description mentions a CPE and it mentions services. The CPE was created
> > before flow controllers or CasMutipliers existed and has no support of
> > them. Services could be Vinci services for the CPE or UIMA-AS services or
> > ???
> >
> > On Fri, Apr 24, 2015 at 5:37 AM, Mario Gazzo 
> wrote:
> >
> >> I am trying to get error handling to work with a custom flow control. I
> >> need to send status information back to a service after the flow
> completed
> >> either with or without errors but I can only do this once for any
> workflow
> >> item because it changes the state of the job, at least without error
> >> replies and wasteful requests. The problem is that I need to do several
> >> retries before finally failing and reporting the status to a service.
> First
> >> I tried to let the CPE do the retry for me by setting the max error
> count
> >> but then a new flow object is created every time and I loose track of
> the

Re: Error handling in flow control

2015-04-24 Thread Eddie Epstein
Can you give more details on the overall pipeline deployment? The initial
description mentions a CPE and it mentions services. The CPE was created
before flow controllers or CasMutipliers existed and has no support of
them. Services could be Vinci services for the CPE or UIMA-AS services or
???

On Fri, Apr 24, 2015 at 5:37 AM, Mario Gazzo  wrote:

> I am trying to get error handling to work with a custom flow control. I
> need to send status information back to a service after the flow completed
> either with or without errors but I can only do this once for any workflow
> item because it changes the state of the job, at least without error
> replies and wasteful requests. The problem is that I need to do several
> retries before finally failing and reporting the status to a service. First
> I tried to let the CPE do the retry for me by setting the max error count
> but then a new flow object is created every time and I loose track of the
> number of retries before this. This means that I don’t know when to report
> the status to the service because it should only happen after the final
> retry.
>
> I then tried to let the flow instance manage the retries by moving back to
> the previous step again but then I get the error
> “org.apache.uima.cas.CASRuntimeException: Data for Sofa feature
> setLocalSofaData() has already been set”, which is because the document
> text is set in this particular test case. I then also tried to reset the
> CAS completely before retrying the pipeline from scratch and this of course
> throws the error “CASAdminException: Can't flush CAS, flushing is
> disabled.”. It would be less wasteful if only the failed step is retried
> instead of the whole pipeline but this requires clean up, which in some
> cases might be impossible. It appears that managing errors can be rather
> complex because the CAS can be in an unknown state and an analysis engine
> operation is not idempotent. I probably need to start the whole pipeline
> from the start if I want more than a single attempt, which gets me back to
> the problem of tracking the number of attempts before reporting back to the
> service.
>
> Does anyone have any good suggestion on how to do this in UIMA e.g.
> passing state information from a failed flow to the next flow attempt?
>
>


Re: UIMA CPE appears not to utilise more than a single thread

2015-04-13 Thread Eddie Epstein
The CPE runs pipeline threads in parallel, not necessarily CAS processors.
In a CPE descriptor, generally all non-CasConsumer components make up the
pipeline.

Change the following line to indicate how many pipeline threads to run, and
make sure the casPoolSize is number of threads +2.



Eddie

On Mon, Apr 13, 2015 at 7:44 AM, Mario Gazzo  wrote:

> It appears that I can only utilise a single CAS processor even if I
> specify many more. I am not sure what I am doing wrong but I think I must
> be missing something important in my configuration.
>
> We only need multithreading and not the distributed features of UIMA CPE
> or similar. I copied and modified the UIMA FIT CpePipeline and CpeBuilder
> to do this and I only altered thread counts and error handling since I want
> the CAS just to be dropped on exceptions. I have verified that the accurate
> number of CAS processors are created using the debugger and I can in
> JConsole see that an equivalent amount of active threads are created but
> only one thread seems to be fed from my simple custom collection reader,
> which in the simple test setup only reads input entries from a file. I can
> see this because I log the thread id inside the AEs, which is always the
> same. I have also verified that the CAS pool size equals the number of
> processors + 2.
>
> Is there some additional collection reader configuration required to feed
> all the other CAS processors?
>
>
>
>
>
>


Re: Ducc Problems

2015-03-03 Thread Eddie Epstein
nnel
>>> stopChannel
>>> INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
>>> queue://ducc.jd.queue.13202 Selector:  Selector:Command=2000 OR
>>> Command=2002.
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.13202
>>> ShutdownNow false
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
>>> queue://ducc.jd.queue.13202 Selector:  Selector:Command=2001.
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.13202
>>> ShutdownNow true
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
>>> queue://ducc.jd.queue.13202 Selector:  Selector:Command=2000 OR
>>> Command=2002.
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.13202
>>> ShutdownNow true
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
>>> queue://ducc.jd.queue.13202 Selector:  Selector:Command=2001.
>>> UIMA-AS Service is Stopping, All CASes Have Been Processed
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.aae.controller.
>>> PrimitiveAnalysisEngineController_impl stop
>>> INFO: Stopping Controller: ducc.jd.queue.13202
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.13202
>>> ShutdownNow true
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
>>> queue://ducc.jd.queue.13202 Selector:  Selector:Command=2000 OR
>>> Command=2002.
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.13202
>>> ShutdownNow true
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsInputChannel
>>> stopChannel
>>> INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
>>> queue://ducc.jd.queue.13202 Selector:  Selector:Command=2001.
>>> Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
>>> activemq.JmsOutputChannel
>>> stop
>>> INFO: Controller: ducc.jd.queue.13202 Output Channel Shutdown Completed
>>>
>>> Thanks Reshu.
>>>
>>>
>>>
>>> On 02/20/2015 12:40 AM, Jaroslaw Cwiklik wrote:
>>>
>>>  One possible explanation for destroy() not getting called is that a
>>>> process
>>>> (JP) may be still working on a CAS when Ducc deallocates the process.
>>>> Ducc
>>>> first asks the process to quiesce and stop and allows it 1 minute to
>>>> terminate on its own. If this does not happen, Ducc kills the process
>>>> via
>>>> kill -9. In such case the process will be clobbered and destroy()
>>>> methods
>>>> in UIMA-AS are not called.
>>>> There should be some evidence in JP logs at the very end. Look for
>>>> something like this:
>>>>
>>>>   Process Received a Message. Is Process target for message:true.
>>>>
>>>>> Target PID:27520
>>>>>>>>>>>>
>>>>>>>>>>> configFactory.stop() - stopped
>>>>>
>>>>>> route:mina:tcp://localhost:49338?transferExchange=true&sync=false
>>>>>>
>>>>> 01:56:22.735 - 94:
>>>> org.apache.uima.aae.controller.PrimitiveAnalysisEngineControl
>>>> ler_impl.quiesceAndStop:
>>>> INFO: Stopping Controller: ducc.jd.queue.226091
>>>> Quiescing UIMA-AS Service. Remaining Number of CASes to Process:0
>>>>
>>>> Look at the timestamp of >

Re: Ruta parallel execution

2014-12-19 Thread Eddie Epstein
Hi Silvestre,

An aggregate deployed with UIMA-AS can be used to run delegate annotators
in parallel, with a few restrictions.
 - the aggregate must be deployed as async=true
 - the parallel delegates must each be running in remote processes
 - the delegates must not modify preexisting FS

As Jens suggests, the resultant latency improvement depends on the remoting
overhead vs processing time. Latency will also be subject to the latency of
the slowest parallel delegate.

Remoting overhead can be reduced using the binary serialization option, but
then all services must have identical typesystems.

Eddie


On Fri, Dec 19, 2014 at 9:10 AM, Silvestre Losada <
silvestre.los...@gmail.com> wrote:

> Hi Jens,
>
> First of all thanks for your detailed answer. UIMA ruta has an option in
> order to execute an analisys engine from ruta script here
>  is described. So inside the script you can execute
> the analysis engine and then apply some rules to the annotations created by
> the analysis engine. What I want is to have the option to execute the
> analysis engines in parallel to save time. Would it be possible?
>
> Kind regards
>
> On 19 December 2014 at 12:35, Jens Grivolla  wrote:
> >
> > Hi Silvestre,
> >
> > there doesn't seem to be anything RUTA-specific in your question. In
> > principle, UIMA-AS allows parallel scaleout and merges the results
> (though
> > I personally have never used it this way), but there are of course a few
> > things to take into account.
> >
> > First, you will of course need to properly define the dependencies
> between
> > your different analysis engines to ensure you always have all then
> > necessary information available, meaning that you can only run things in
> > parallel that are independent of one another. And then you will have to
> see
> > if the overhead from distributing your CAS to several engines running in
> > parallel and then merging the results is not greater than just having it
> in
> > one colocated pipeline that can pass the information more efficiently. I
> > guess you'll have to benchmark your specific application, but maybe
> > somebody with more experience can give you some general directions...
> >
> > Best,
> > Jens
> >
> > On Thu, Dec 18, 2014 at 12:26 PM, Silvestre Losada <
> > silvestre.los...@gmail.com> wrote:
> > >
> > > Well let me explain.
> > >
> > > Ruta scripts are really good to work over output of analysis engines,
> > each
> > > analysis engine will make some atomic work and using ruta rules you can
> > > easily work over generated annotations combine them, remove them...
> > What I
> > > need is to execute several analysis engines in parallel to improve the
> > > response time, so now the analysis engines are executed sequentially
> and
> > I
> > > want to execute them in parallel, then take the output of all of them
> and
> > > apply some ruta rules to the output.
> > >
> > > would it be possible.
> > >
> > > On 17 December 2014 at 18:13, Peter Klügl 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I haven't used UIMA-AS (with ruta) in a real application yet, but I
> > > > tested it once for an rc. Did you face any problems?
> > > >
> > > > Best
> > > >
> > > > Peter
> > > >
> > > > Am 17.12.2014 14:34, schrieb Silvestre Losada:
> > > > > Hi All,
> > > > >
> > > > > Is there any way to execute ruta scripts in parallel, using uima-AS
> > > > >  aproach? in case yes could you provide me an example.
> > > > >
> > > > > Kind regards.
> > > > >
> > > >
> > > >
> > >
> >
>


Re: DUCC- Agent1 is on Physical and Agent2 is on virtual=Slow the job process timing

2014-12-19 Thread Eddie Epstein
Hi Reshu,

On Fri, Dec 19, 2014 at 12:26 AM, reshu.agarwal 
wrote:

>
> Hi,
>
> Is there any problem if one Agent node is on Physical(Master) and one
> agent node is on virtual?
>
> I am running a job which is having avg processing timing of 20 min when I
> have configured a single machine DUCC (physical machine)as well as when
> both nodes were on physical machine only.
>

So the job is running at the same speed on one physical machine as on two
physical machines?The two machine have similar CPU performance and number
of cores?


>
> When I have shifted my one agent node to virtual machine avg processing
> timing of Job was increased to 1 Hour. Here I noticed that my job driver
> was also running only on virtual machine's agent node.
>

Hard to diagnose with such little information. Look on the job details
page. The work items tab will show the processing time for each work item
and the machine it ran on. See if the timing is clearly different for work
on virtual machine. Look at the performance tab and compare the breakdown
between fast and slow jobs, where are the differences?

There are several factors which influence performance. Is the job process
CPU bound or does it have significant I/O wait time? Are the total number
of processing threads on a machine much more than the number of real CPU
cores on the machine? Average CPU usage for each JP are shown on the
processes tab of each job; 100% = one CPU, 200% = 2 CPU. How do these
numbers look vs the number of processing threads each JP is running?


>
> Can we run job driver to specific agent node so that I will be able to
> test any other Case Scenario? Because I also tried to run my job's process
> on agent node of physical machine but it didn't reflect the processing time
> much.
>

Normally a job driver is not a bottleneck. It can be a bottleneck if the
driver is sending raw data instead of references to raw data to the JPs. Or
if the JD is running on a machine that is in bad shape, paging, etc. What
is the CPU reported for the JD?


>
> Thanks in advanced.
>
> Reshu.
>


Re: Serializing Specific View to XMI

2014-12-04 Thread Eddie Epstein
I think that is not supported directly. One could use the CasCopier to copy
the view(s) of interest to a new, empty CAS and serialize to xmi file from
that.

Eddie

On Wed, Dec 3, 2014 at 9:04 AM, Jakob Sahlström 
wrote:

> Hi,
>
> I'm dealing with a CAS with multiple views, namely a Gold View and a System
> View. I've been using XmiCasSerializer to save CASes but now I would like
> to save only the contents in the System View. The XmiCasSerializer seems to
> save the whole CAS with all views. Is there an easy way of just saving a
> single view to an xmi file?
>
> Best,
>
> Jakob
>


Re: DUCC doesn't use all available machines

2014-11-30 Thread Eddie Epstein
On Sun, Nov 30, 2014 at 11:48 AM, Simon Hafner 
wrote:

> 2014-11-30 7:25 GMT-06:00 Eddie Epstein :
> > On Sat, Nov 29, 2014 at 4:46 PM, Simon Hafner 
> wrote:
> >
> >> I've thrown some numbers at it (doubling each) and it's running at
> >> comfortable 125 procs. However, at about 6.1k of 6.5k items, the procs
> >> drop down to 30.
> >>
> >
> > 125 processes at 8 threads each = 1000 active pipelines. How CPU cores
> > are these 1000 pipelines running on?
> Only 60.
>

For CPU bound pipelines, throughput will tend to degrade with more threads
than cores.


Re: DUCC doesn't use all available machines

2014-11-30 Thread Eddie Epstein
On Sat, Nov 29, 2014 at 4:46 PM, Simon Hafner  wrote:

> I've thrown some numbers at it (doubling each) and it's running at
> comfortable 125 procs. However, at about 6.1k of 6.5k items, the procs
> drop down to 30.
>

125 processes at 8 threads each = 1000 active pipelines. How CPU cores
are these 1000 pipelines running on?


Re: DUCC doesn't use all available machines

2014-11-28 Thread Eddie Epstein
Now you are hitting a limit configured in ducc.properties:

  # Max number of work-item CASes for each job
  ducc.threads.limit = 500

62 job process * 8 threads per process = 496 max concurrent work items.
This was put in to limit the memory required by the job driver. This value
can probably be pushed up in the range of 700-800 before the job driver
will go OOM. There are configuration parameters to increase JD memory:

  # Memory size in MB allocated for each JD
  ducc.jd.share.quantum = 450
  # JD max heap size. Should be smaller than the JD share quantum
  ducc.driver.jvm.args = -Xmx400M -DUimaAsCasTracking

DUCC would have to be restarted for the JD size parameters to take effect.

One of the current DUCC development items is to significantly reduce the
memory needed per work item, and raise the default limit for concurrent
work items by two or three orders of magnitude.



On Fri, Nov 28, 2014 at 6:40 PM, Simon Hafner  wrote:

> I've put the fudge to 12000, and it jumped immediately to 62 procs.
> However, it doesn't spawn new ones even though it has about 6k items
> left and it doesn't spawn more procs.
>
> 2014-11-17 15:30 GMT-06:00 Jim Challenger :
> > It is also possible that RM "prediction" has decided that additional
> > processes are not needed.  It
> > appears that there were likely 64 work items dispatched, plus the 6
> > completed, leaving only
> > 30 that were "idle".  If these work items appeared to be completing
> quickly,
> > the RM would decide
> > that scale-up would be wasteful and not do it.
> >
> > Very gory details if you're interested:
> > The time to start a new processes is measured by the RM based on the
> > observed initialization time of the processes plus an estimate of how
> long
> > it would take to get
> > a new process actually running.  A fudge-factor is added on top of this
> > because in a large operation
> > it is wasteful to start processes (with associated preemptions) that only
> > end up doing a "few" work
> > tems.  All is subjective and configurable.
> >
> > The average time-per-work item is also reported to the RM.
> >
> > The RM then looks at the number of work items remaining, and the
> estimated
> > time needed to
> > processes this work based on the above, and if it determines that the job
> > will be completed before
> > new processes can be scaled up and initialized, it does not scale up.
> >
> > For short jobs, this can be a bit inaccurate, but those jobs are short :)
> >
> > For longer jobs, the time-per-work-item becomes increasingly accurate so
> the
> > RM prediction tends
> > to improve and ramp-up WILL occur if the work-item time turns out to be
> > larger than originally
> > thought.  (Our experience is that work-item times are mostly uniform with
> > occasional outliers, but
> > the prediction seems to work well).
> >
> > Relevant configuration parameters in ducc.properties:
> > # Predict when a job will end and avoid expanding if not needed. Set to
> > false to disable prediction.
> >ducc.rm.prediction = true
> > # Add this fudge factor (milliseconds) to the expansion target when using
> > prediction
> >ducc.rm.prediction.fudge = 12
> >
> > You can observe this in the rm log, see the example below.  I'm
> preparing a
> > guide to this log; for now,
> > the net of these two log lines is: the projection for the job in question
> > (job 208927) is that 16 processes
> > are needed to complete this job, even though the job could use 20
> processes
> > at full expanseion - the BaseCap -
> > so a max of 16 will be scheduled for it,  subject to fair-share
> constraint.
> >
> > 17 Nov 2014 15:07:38,880  INFO RM.RmJob - */getPrjCap/* 208927  bobuser
> O 2
> > T 343171 NTh 128 TI 143171 TR 6748.601431980907 R 1.8967e-02 QR 5043 P
> 6509
> > F 0 ST 1416254363603*/return 16/*
> > 17 Nov 2014 15:07:38,880  INFO RM.RmJob - */initJobCap/* 208927 bobuser
> O 2
> > */Base cap:/* 20 Expected future cap: 16 potential cap 16 actual cap 16
> >
> > Jim
> >
> >
> > On 11/17/14, 3:44 PM, Eddie Epstein wrote:
> >>
> >> DuccRawTextSpec.job specifies that each job process (JP)
> >> run 8 analytic pipeline threads. So for this job with 100 work
> >> items, no more than 13 JPs would ever be started.
> >>
> >> After successful initialization of the first JP, DUCC begins scaling
> >> up the number of JPs using doubling. During JP scale up the
> >> scheduler monitors the work item completion rate, compares that
> >> w

Re: Ducc: Rename failed

2014-11-28 Thread Eddie Epstein
To debug, please add the following option to the job submission:
--all_in_one local

This will run all the code in a single process on the machine doing the
submit. Hopefully the log file and/or console will be more informative.

On Fri, Nov 28, 2014 at 1:41 PM, Simon Hafner  wrote:

> 2014-11-28 10:45 GMT-06:00 Eddie Epstein :
> > DuccCasCC component has presumably created
> > /home/ducc/analysis/txt.processed/5911.txt_0_processed.zip_temp and
> written
> > to it?
> I don't know, the _temp file doesn't exist anymore.
>
> > Did you run this sample job in something other than cluster mode?
> I get the same error running on a single machine.
>


Re: Ducc: Rename failed

2014-11-28 Thread Eddie Epstein
DuccCasCC component has presumably created
/home/ducc/analysis/txt.processed/5911.txt_0_processed.zip_temp and written
to it?

Did you run this sample job in something other than cluster mode?




On Fri, Nov 28, 2014 at 10:23 AM, Simon Hafner 
wrote:

> When running DUCC in cluster mode, I get "Rename failed". The file
> mentioned in the error message exists in the txt.processed/ directory.
> The mount is via nfs (rw,sync,insecure).
>
> org.apache.uima.resource.ResourceProcessException: Received Exception
> In Message From Service on Queue:ducc.jd.queue.75 Broker:
> tcp://10.0.0.164:61617?jms.useCompression=true Cas
> Identifier:18acd63:149f6f562d3:-7fa6 Exception:{3}
> at
> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendAndReceiveCAS(BaseUIMAAsynchronousEngineCommon_impl.java:2230)
> at
> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl.sendAndReceiveCAS(BaseUIMAAsynchronousEngineCommon_impl.java:2049)
> at org.apache.uima.ducc.jd.client.WorkItem.run(WorkItem.java:145)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.uima.aae.error.UimaEEServiceException:
> org.apache.uima.analysis_engine.AnalysisEngineProcessException
> at
> org.apache.uima.adapter.jms.activemq.JmsOutputChannel.sendReply(JmsOutputChannel.java:932)
> at
> org.apache.uima.aae.controller.BaseAnalysisEngineController.handleAction(BaseAnalysisEngineController.java:1172)
> at
> org.apache.uima.aae.controller.PrimitiveAnalysisEngineController_impl.takeAction(PrimitiveAnalysisEngineController_impl.java:1145)
> at
> org.apache.uima.aae.error.handler.ProcessCasErrorHandler.handleError(ProcessCasErrorHandler.java:405)
> at
> org.apache.uima.aae.error.ErrorHandlerChain.handle(ErrorHandlerChain.java:57)
> at
> org.apache.uima.aae.controller.PrimitiveAnalysisEngineController_impl.process(PrimitiveAnalysisEngineController_impl.java:1065)
> at
> org.apache.uima.aae.handler.HandlerBase.invokeProcess(HandlerBase.java:121)
> at
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:543)
> at
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.handle(ProcessRequestHandler_impl.java:1050)
> at
> org.apache.uima.aae.handler.input.MetadataRequestHandler_impl.handle(MetadataRequestHandler_impl.java:78)
> at
> org.apache.uima.adapter.jms.activemq.JmsInputChannel.onMessage(JmsInputChannel.java:728)
> at
> org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:535)
> at
> org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:495)
> at
> org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:467)
> at
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:325)
> at
> org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:263)
> at
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1058)
> at
> org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:952)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> org.apache.uima.aae.UimaAsThreadFactory$1.run(UimaAsThreadFactory.java:129)
> ... 1 more
> Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
> at
> org.apache.uima.ducc.sampleapps.DuccCasCC.process(DuccCasCC.java:117)
> at
> org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
> at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:309)
> at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:569)
> at
> org.apache.uima.a

Re: DUCC org.apache.uima.util.InvalidXMLException and no logs

2014-11-27 Thread Eddie Epstein
Those are the only two log files? Should be a ducc.log (probably with no
more info than on the console), and either one or both of the job driver
logfiles: jd.out.log and jobid-JD-jdnode-jdpid.log. If for some reason the
job driver failed to start, check the job driver agent log (the agent
managing the System/JobDriver reservation) for more info on what happened.

On Wed, Nov 26, 2014 at 10:06 PM, Simon Hafner 
wrote:

> When launching the Raw Text example application, it doesn't load with
> the following error:
>
> [ducc@ip-10-0-0-164 analysis]$ MyAppDir=$PWD MyInputDir=$PWD/txt
> MyOutputDir=$PWD/txt.processed ~/ducc_install/bin/ducc_submit -f
> DuccRawTextSpec.job
> Job 50 submitted
> id:50 location:5991@ip-10-0-0-164
> id:50 state:WaitingForDriver
> id:50 state:Completing total:-1 done:0 error:0 retry:0 procs:0
> id:50 state:Completed total:-1 done:0 error:0 retry:0 procs:0
> id:50 rationale:job driver exception occurred:
> org.apache.uima.util.InvalidXMLException at
>
> org.apache.uima.ducc.common.uima.UimaUtils.getXMLInputSource(UimaUtils.java:246)
>
> However, there are no logs with a stacktrace or similar, how do I get
> hold of one? The only files in the log directory are:
>
> [ducc@ip-10-0-0-164 analysis]$ cat logs/50/specified-by-user.properties
> #Thu Nov 27 03:00:57 UTC 2014
> working_directory=/home/ducc/analysis
> process_descriptor_CM=org.apache.uima.ducc.sampleapps.DuccTextCM
> driver_descriptor_CR=org.apache.uima.ducc.sampleapps.DuccJobTextCR
> cancel_on_interrupt=
> process_descriptor_CC_overrides=UseBinaryCompression\=true
> process_descriptor_CC=org.apache.uima.ducc.sampleapps.DuccCasCC
> log_directory=/home/ducc/analysis/logs
> wait_for_completion=
> classpath=/home/ducc/analysis/lib/*
> process_thread_count=8
> driver_descriptor_CR_overrides=BlockSize\=10 SendToLast\=true
> InputDirectory\=/home/ducc/analysis/txt
> OutputDirectory\=/home/ducc/analysis/txt.processed
> process_per_item_time_max=20
>
> process_descriptor_AE=/home/ducc/analysis/opennlp.uima.OpenNlpTextAnalyzer/opennlp.uima.OpenNlpTextAnalyzer_pear.xml
> description=DUCC raw text sample application
> process_jvm_args=-Xmx3G -XX\:+UseCompressedOops
>
> -Djava.util.logging.config.file\=/home/ducc/analysis/ConsoleLogger.properties
> scheduling_class=normal
> process_memory_size=4
> specification=DuccRawTextSpec.job
>
> [ducc@ip-10-0-0-164 analysis]$ cat logs/50/job-specification.properties
> #Thu Nov 27 03:00:57 UTC 2014
> working_directory=/home/ducc/analysis
> process_descriptor_CM=org.apache.uima.ducc.sampleapps.DuccTextCM
> process_failures_limit=20
> driver_descriptor_CR=org.apache.uima.ducc.sampleapps.DuccJobTextCR
> cancel_on_interrupt=
> process_descriptor_CC_overrides=UseBinaryCompression\=true
> process_descriptor_CC=org.apache.uima.ducc.sampleapps.DuccCasCC
> classpath_order=ducc-before-user
> log_directory=/home/ducc/analysis/logs
> submitter_pid_at_host=5991@ip-10-0-0-164
> wait_for_completion=
> classpath=/home/ducc/analysis/lib/*
> process_thread_count=8
> driver_descriptor_CR_overrides=BlockSize\=10 SendToLast\=true
> InputDirectory\=/home/ducc/analysis/txt
> OutputDirectory\=/home/ducc/analysis/txt.processed
> process_initialization_failures_cap=99
> process_per_item_time_max=20
>
> process_descriptor_AE=/home/ducc/analysis/opennlp.uima.OpenNlpTextAnalyzer/opennlp.uima.OpenNlpTextAnalyzer_pear.xml
> description=DUCC raw text sample application
> process_jvm_args=-Xmx3G -XX\:+UseCompressedOops
>
> -Djava.util.logging.config.file\=/home/ducc/analysis/ConsoleLogger.properties
> scheduling_class=normal
> environment=HOME\=/home/ducc LANG\=en_US.UTF-8 USER\=ducc
> process_memory_size=4
> user=ducc
> specification=DuccRawTextSpec.job
>


Re: DUCC web server interfacing

2014-11-21 Thread Eddie Epstein
On Thu, Nov 20, 2014 at 10:01 PM, D. Heinze  wrote:

> Eddie... thanks.  Yes, that sounds like I would not have the advantage of
> DUCC managing the UIMA pipeline.
>

Depends on the definition of "managing". DUCC manages the lifecycle of
analytic pipelines running as job processes and as services. There are
differences in how DUCC decides how many instances of each are run. And you
are right that only for jobs will DUCC send work items to the analytic
pipeline.


>
> To break it down a little for the uninitiated (me),
>
>  1. how do I start a DUCC job that stays resident because it has high
> startup cost (e.g. 2 minutes to load all the resources for the UIMA
> pipeline VS about 2 seconds to process each request)?
>

Run the pipeline as a service. A service can be configured to start
automatically, as soon as DUCC starts. If the load on the service
increases, DUCC can be told [manually or programmatically] to launch
additional service instances.


> 2. once I have a resident job, how do I get the Job Driver to iteratively
> feed references to each next document (as they are received) to the
> resident Job Process?  Because all the input jobs will be archived anyhow,
> I'm okay with passing them through the file system if needed.
>

The easiest approach is to have an application driver, say a web service,
directly feed input to the service. If using references as input, the same
analytic pipeline could be used both for live processing as a service and
for batch job processing.

DUCC jobs are designed for batch work, where the size of the input
collection is known and the number of job processes will be replicated as
much as possible, given available resources and the job's fair share when
multiple jobs are running.

DUCC services are intended to support job pipelines, for example a large
memory but low latency analytic that can be shared by many job process
instances, or for interactive applications.

Have you looked at creating a UIMA-AS service from a UIMA pipeline?

Eddie


Re: DUCC web server interfacing

2014-11-20 Thread Eddie Epstein
Ooops, in this case the web server would be feeding the service directly.

On Thu, Nov 20, 2014 at 9:04 PM, Eddie Epstein  wrote:

> The preferred approach is to run the analytics as a DUCC service, and have
> an application driver that feeds the service instances with incoming data.
> This service would be a scalable UIMA-AS service, which could have as
> many instances as are needed to keep up with the load. The driver would
> use the uima-as client API to feed the service. The application driver
> could
> itself be another DUCC service.
>
> DUCC manages the life cycle of its services, including restarting them on
> failure.
>
> Eddie
>
>
> On Thu, Nov 20, 2014 at 6:45 PM, Daniel Heinze  wrote:
>
>> I just installed DUCC this week and can process batch jobs.  I would like
>> DUCC to initiate/manage one or more copies of the same UIMA pipeline that
>> has high startup overhead and keep it/them active and feed it/them with
>> documents that arrive periodically over a web service.  Any suggestions on
>> the preferred way (if any) to do this in DUCC.
>>
>>
>>
>> Thanks / Dan
>>
>>
>


Re: DUCC web server interfacing

2014-11-20 Thread Eddie Epstein
The preferred approach is to run the analytics as a DUCC service, and have
an application driver that feeds the service instances with incoming data.
This service would be a scalable UIMA-AS service, which could have as
many instances as are needed to keep up with the load. The driver would
use the uima-as client API to feed the service. The application driver
could
itself be another DUCC service.

DUCC manages the life cycle of its services, including restarting them on
failure.

Eddie


On Thu, Nov 20, 2014 at 6:45 PM, Daniel Heinze  wrote:

> I just installed DUCC this week and can process batch jobs.  I would like
> DUCC to initiate/manage one or more copies of the same UIMA pipeline that
> has high startup overhead and keep it/them active and feed it/them with
> documents that arrive periodically over a web service.  Any suggestions on
> the preferred way (if any) to do this in DUCC.
>
>
>
> Thanks / Dan
>
>


Re: DUCC-Un-managed Reservation??

2014-11-18 Thread Eddie Epstein
On Tue, Nov 18, 2014 at 1:05 AM, reshu.agarwal 
wrote:

>
> Hi,
>
> I am bit confused. Why we need un-managed reservation? Suppose we give 5GB
> Memory size to this reservation. Can this RAM be consumed by any process if
> required?
>

Basically yes. See more info about "Rogue Process" in the duccbook.


>
> In my scenario,  when all RAMs of Nodes was consumed by JOBs, all
> processes went in waiting state. I need some reservation of RAMs for this
> so that it can not be consumed by shares for Job Processes but if required
> internally it could be used.
>

Any idea why all process went into waiting state? Did the job details page
show that these JPs were assigned work items?


>
> Can un-managed reservation be used for this?
>
> Thanks in advanced.
>
> Reshu.
>
>
>


Re: DUCC doesn't use all available machines

2014-11-17 Thread Eddie Epstein
DuccRawTextSpec.job specifies that each job process (JP)
run 8 analytic pipeline threads. So for this job with 100 work
items, no more than 13 JPs would ever be started.

After successful initialization of the first JP, DUCC begins scaling
up the number of JPs using doubling. During JP scale up the
scheduler monitors the work item completion rate, compares that
with the JP initialization time, and stops scaling up JPs when
starting more JPs will not make the job run any faster.

Of course JP scale up is also limited by the job's "fair share"
of resources relative to total resources available for all preemptable jobs.

To see more JPs, increase the number and/or size of the input text files,
or decrease the number of pipeline threads per JP.

Note that it can be counter productive to run "too many" pipeline
threads per machine. Assuming analytic threads are 100% CPU bound,
running more threads than real cores will often slow down the overall
document processing rate.


On Mon, Nov 17, 2014 at 6:48 AM, Simon Hafner  wrote:

> I fired the DuccRawTextSpec.job on a cluster consisting of three
> machines, with 100 documents. The scheduler only runs the processes on
> two machines instead of all three. Can I mess with a few config
> variables to make it use all three?
>
> id:22 state:Running total:100 done:0 error:0 retry:0 procs:1
> id:22 state:Running total:100 done:0 error:0 retry:0 procs:2
> id:22 state:Running total:100 done:0 error:0 retry:0 procs:4
> id:22 state:Running total:100 done:1 error:0 retry:0 procs:8
> id:22 state:Running total:100 done:6 error:0 retry:0 procs:8
>


Re: DUCC stuck at WaitingForResources on an Amazon Linux

2014-11-15 Thread Eddie Epstein
On Fri, Nov 14, 2014 at 8:11 PM, Simon Hafner  wrote:

> So to run effectively, I would need more memory, because the job wants
> two shares? ... Yes. With a larger node it works. What would be a
> reasonable memory size for a ducc node?
>
> Really depends on the application code. Quoting from the DUCC overview at
http://uima.apache.org/doc-uimaducc-whatitam.html

  "DUCC is particularly well suited to run large memory Java analytics in
   multiple threads in order to fully utilize multicore machines."

Our experience to date has been with machines 16-256GB and 4-32 CPU cores.
Smaller machines, 8GB or less, have only been used for development of DUCC
itself, with dummy analytics that use little memory and CPU.


  1   2   3   >