from:"Eddie Epstein"

Re: Anybody using UIMA DUCC? Care to give a hand?

2022-11-11 Thread Eddie Epstein

Hi Richard,

Our last DUCC cluster was retired earlier this year. I would vote for
retirement.

Regards,
Eddie

On Fri, Nov 11, 2022 at 10:38 AM Richard Eckart de Castilho 
wrote:

> Hi all,
>
> is anybody using UIMA DUCC?
>
> If yes, it would be great if you could lend us a hand in preparing a new
> release.
>
> If nobody steps up, then I will suggest to retire UIMA DUCC towards the
> end of Nov 2022 (in about two weeks).
>
> Cheers,
>
> -- Richard (with the Apache UIMA PMC Chair hat on)
>
>

Re: Recover or invalidate Collection Reader CAS

2022-08-26 Thread Eddie Epstein

The JMS service descriptor defines timeout:
https://uima.apache.org/d/uima-as-v2-current/uima_async_scaleout.html#ugr.async.ov.concepts.jms_descriptor

Error configuration:
https://uima.apache.org/d/uima-as-v2-current/uima_async_scaleout.html#ugr.ref.async.deploy.descriptor.errorconfig

Eddie

On Fri, Aug 26, 2022 at 10:01 AM Daniel Cosio  wrote:

> Any chance you could point me to where this is defined in the docs?
> Daniel Cosio
> dcco...@gmail.com
>
>
>
> > On Aug 26, 2022, at 8:52 AM, Eddie Epstein  wrote:
> >
> > UIMA-AS supports timeouts for remote annotators; the default timeout is
> > infinite. On timeout uima-as will take the action specified by error
> > handling configuration, but in any case the CAS sent to the remote will
> be
> > available for reuse.
> >
> > Eddie
> >
> > On Thu, Aug 25, 2022 at 12:05 PM Timo Boehme 
> > wrote:
> >
> >> PS: if the OS is killing the process because of lack of memory (typical
> >> case) it means the Java VM is allowed to used more (heap) memory as is
> >> available at this node. Maybe consider to adjust the memory setting for
> >> the Java process to prevent the OS kill. Then you may get an
> >> OutOfMemoryException which is bad too but the JavaVM might be able to do
> >> some cleanup/shutdown etc.
> >>
> >>
> >> Timo Boehme
> >>
> >>
> >>
> >> Am 25.08.22 um 17:55 schrieb Timo Boehme:
> >>> Hi Daniel,
> >>>
> >>> not using UIMA-AS myself but if the OS is killing a process because of
> >>> lack of resources it normally does so with a hard kill which does not
> >>> allow the Java VM process to do any shutdown work.
> >>> One would need a separate process controlling the Java one and react if
> >>> the Java VM is killed - however this won't help in getting the CAS
> >>> released (or the controlling process has specific UIMA knowledge). I
> >>> don't known if the UIMA-AS uses such a 2-process model per node but
> >>> assume it does not.
> >>>
> >>>
> >>> Regards,
> >>> Timo Boehme
> >>>
> >>>
> >>> Am 25.08.22 um 17:28 schrieb Daniel Cosio:
> >>>> Yes, this is uima-as-jms.. the pipeline gets a signal from the os and
> >>>> shutdown.. but we loose the CAS. Is there any api I can use to tell
> >>>> the collection reader to invalidate. I know the AE has
> >>>> A temp queue connection that communicates the CAS releases..I was
> >>>> wonder if there was any way of getting the temp queue connection and
> >>>> sending the message back to return the CAS.. Possible in a shutdown
> >> hook.
> >>>>
> >>>>
> >>>> Daniel Cosio
> >>>> dcco...@gmail.com
> >>>>
> >>>>
> >>>>
> >>>>> On Aug 25, 2022, at 9:20 AM, Eddie Epstein 
> >> wrote:
> >>>>>
> >>>>> Daniel, is this again a uima-as deployment? If so, since the OS kills
> >>>>> processes, is it some remote AE being killed?
> >>>>>
> >>>>> Eddie
> >>>>>
> >>>>> On Wed, Aug 24, 2022 at 10:04 AM Daniel Cosio 
> >> wrote:
> >>>>>
> >>>>>> Hi, I have some instances where the OS has killed a pipeline to
> >> recover
> >>>>>> resources.. When this happens the pipeline never returns the CAS to
> >> the
> >>>>>> reader so the reader now has 1 less CASes
> >>>>>> Available.. Is there a was to either
> >>>>>> 1. Add a shutdown hook on the pipeline to return the CAS if it gets
> a
> >>>>>> shutdown signal
> >>>>>> or
> >>>>>> 2. Set an expiration on the collection reader to expire a CAS that
> >>>>>> is not
> >>>>>> returns and issue a new one into the CAS pool
> >>>>>>
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>
> >>>>>> Daniel Cosio
> >>>>>> dcco...@gmail.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> OntoChem GmbH
> >> Blücherstraße 24
> >> 06120 Halle (Saale)
> >> Germany
> >>
> >> email: timo.boe...@ontochem.com | web: www.ontochem.com
> >> HRB 215461 Amtsgericht Stendal  | USt-IdNr.: DE246232735
> >> managing directors: Dr. Lutz Weber (CEO), Dr. Felix Berthelmann (COO)
> >>
> >>
>
>

Re: Recover or invalidate Collection Reader CAS

2022-08-26 Thread Eddie Epstein

UIMA-AS supports timeouts for remote annotators; the default timeout is
infinite. On timeout uima-as will take the action specified by error
handling configuration, but in any case the CAS sent to the remote will be
available for reuse.

Eddie

On Thu, Aug 25, 2022 at 12:05 PM Timo Boehme 
wrote:

> PS: if the OS is killing the process because of lack of memory (typical
> case) it means the Java VM is allowed to used more (heap) memory as is
> available at this node. Maybe consider to adjust the memory setting for
> the Java process to prevent the OS kill. Then you may get an
> OutOfMemoryException which is bad too but the JavaVM might be able to do
> some cleanup/shutdown etc.
>
>
> Timo Boehme
>
>
>
> Am 25.08.22 um 17:55 schrieb Timo Boehme:
> > Hi Daniel,
> >
> > not using UIMA-AS myself but if the OS is killing a process because of
> > lack of resources it normally does so with a hard kill which does not
> > allow the Java VM process to do any shutdown work.
> > One would need a separate process controlling the Java one and react if
> > the Java VM is killed - however this won't help in getting the CAS
> > released (or the controlling process has specific UIMA knowledge). I
> > don't known if the UIMA-AS uses such a 2-process model per node but
> > assume it does not.
> >
> >
> > Regards,
> > Timo Boehme
> >
> >
> > Am 25.08.22 um 17:28 schrieb Daniel Cosio:
> >> Yes, this is uima-as-jms.. the pipeline gets a signal from the os and
> >> shutdown.. but we loose the CAS. Is there any api I can use to tell
> >> the collection reader to invalidate. I know the AE has
> >> A temp queue connection that communicates the CAS releases..I was
> >> wonder if there was any way of getting the temp queue connection and
> >> sending the message back to return the CAS.. Possible in a shutdown
> hook.
> >>
> >>
> >> Daniel Cosio
> >> dcco...@gmail.com
> >>
> >>
> >>
> >>> On Aug 25, 2022, at 9:20 AM, Eddie Epstein 
> wrote:
> >>>
> >>> Daniel, is this again a uima-as deployment? If so, since the OS kills
> >>> processes, is it some remote AE being killed?
> >>>
> >>> Eddie
> >>>
> >>> On Wed, Aug 24, 2022 at 10:04 AM Daniel Cosio 
> wrote:
> >>>
> >>>> Hi, I have some instances where the OS has killed a pipeline to
> recover
> >>>> resources.. When this happens the pipeline never returns the CAS to
> the
> >>>> reader so the reader now has 1 less CASes
> >>>> Available.. Is there a was to either
> >>>> 1. Add a shutdown hook on the pipeline to return the CAS if it gets a
> >>>> shutdown signal
> >>>> or
> >>>> 2. Set an expiration on the collection reader to expire a CAS that
> >>>> is not
> >>>> returns and issue a new one into the CAS pool
> >>>>
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>> Daniel Cosio
> >>>> dcco...@gmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >
> >
>
>
> --
> OntoChem GmbH
> Blücherstraße 24
> 06120 Halle (Saale)
> Germany
>
> email: timo.boe...@ontochem.com | web: www.ontochem.com
> HRB 215461 Amtsgericht Stendal  | USt-IdNr.: DE246232735
> managing directors: Dr. Lutz Weber (CEO), Dr. Felix Berthelmann (COO)
>
>

Re: Towards a (new) UIMA CAS JSON format - feedback welcome!

2021-08-26 Thread Eddie Epstein

Richard,
Looks promising! I put a few comments in the drive document.
Regards, Eddie

On Fri, Aug 20, 2021 at 5:27 AM Richard Eckart de Castilho 
wrote:

> Hi all,
>
> to facilitate working with UIMA CAS data and to promote interoperability
> between different UIMA implementations, a new UIMA JSON CAS format is in
> the works - you may already have noticed a corresponding issue in Jira
> as well as prototype pull requests in the Apache UIMA Java SDK as well
> as in DKPro Cassis. However, the work so far was only preliminary with
> the goal of creating a reasonably comprehensive specification draft and
> that would be suitable for general comments.
>
> That draft is now available here and the comment functionality is enabled:
>
>
> https://docs.google.com/document/d/1tHQKbN4rPKOlkjQFGIEoIzI4ZBWRRzPBdOWMK6MIpV8/edit?usp=sharing
>
> If you think a JSON format for the UIMA CAS is a good idea, please have
> a look and provide any comments directly in the document or alternatively
> here to the list.
>
> The two prototype implementations can be found here if you would like
> to play around with them:
>
> * Apache UIMA Java SDK : https://github.com/apache/uima-uimaj/pull/137
> * DKPro Cassis (Python): https://github.com/dkpro/dkpro-cassis/pull/169
>
> Note that the prototype implementations largely but not fully follow the
> latest specification draft - this is all early work-in-progress.
>
> Looking forward to your comments!
>
> Best,
>
> -- Richard
>
> P.S.: this mail is cross-posted to the UIMA developers list. Please
>   send any replies to the users list.

Re: UIMA DUCC slow processing

2020-06-15 Thread Eddie Epstein

The time sequence of a DUCC job is as follows:
1. The JobDriver is started and the CR.init method called
2. When CR.init completes successfully one or more JobProcesses are started
and the Aggregate pipeline init method in each called
3. If the first pipeline init to complete is successful the DUCC job status
changes to RUNNING

The Processes tab on the job details page shows the init times for the JD
(JobDriver) and each of the JobProcesses. The ducc.log file on the Files
tab gives timestamps for job state changes.

Reported initialization times correspond to the init() method calls of the
UIMA components. Is the initialization delay in the CR init, or the
JobProcess init? Anything interesting in the logfiles for those components?

Normally the number of tasks should match the number of workitems. These
can be quite different if the JobProcess is using a custom UIMA-AS
asynchronous threading model. What do you see on the Work Items tab?

For debugging, DUCC's --all_in_one option allow running all the components,
CR + CM + AE + CC, in a single thread in the same process. I'd suggest that
for the CasConsummer issue. If that works, and if you are running multiple
pipelines then there is likely a thread safety issue involved with
Elasticsearch API.

Eddie

On Mon, Jun 15, 2020 at 1:30 AM Dr. Raja M. Suleman <
raja.m.sulai...@gmail.com> wrote:

> Thank you very much for your response.
>
> Actually I am working on a project that would require horizontal scaling
> therefore I am focused on DUCC at the moment. My original query started
> with my question regarding a job I had created which was giving me a low
> throughput. The pipeline for this job looks like this:
>
>1.  A CollectionReader connects to an Elasticsearch server and reads ids
>from an index and adds *1* id in each workitem which is then passed to
>the CasMultipler.
>2. The CASMultiplier uses the 'id' in each workitem to get the 'document
>text' from the Elasticsearch index. Each document text is a short
> review (1
>- 20 lines) of English. In the Abstract 'next()' method I create an
> empty
>JCas object and add the document text and other details related to the
>review to the DocumentInfo(newcas) and return the JCas object.
>3. My AnalysisEngine is running sentiment analysis on the document text.
>sentiment analysis is a computationally expensive operation specially
> for
>longer reviews.
>4. Finally my CasConsumer is writing each DocumentInfo object into a
>Elasticsearch index.
>
>
> A few things I noticed running this jobs and would be grateful for your
> comments on them:
>
>1. The job's initialization time increases with the number of documents
>in the index exponentially. I'm using the Elasticsearch scroll API which
>returns all the document ids within milliseconds. However, the DUCC job
>takes a long time to start running (~35 minutes for 100k documents).
> I've
>noticed that the initialization time for the DUCC job increases
>exponentially with the number of records. Is this due to the new CASes
>being generated for each in CollectionReader.
>2.  While checking the Performance tab of a job in the webserver UI, I
>noticed that under the "Tasks" column, the number of Tasks for all the
>components except the AnalysisEngine (AE) is twice the number of
> documents
>processed, e.g. if the job has processed 100 documents, it will show 200
>tasks for all components and 100 for the AE component.
>3. In the CasConsumer, I tried to use the BulkProcessor provided by the
>Elasticsearch Java API, which works asynchronously to send bulk indexing
>requests. However, asynchronous calls weren't registering and the
>CasConsumer would return without writing anything in the Elasticsearch
>index. I checked the job logs and couldn't find any error messages.
>
> I'm sorry for another long message and I truly am grateful to you for your
> kind guidance.
>
> Thank you very much.
>
> On Mon, 15 Jun 2020, 00:34 Eddie Epstein,  wrote:
>
> > I forgot to add, if your application does not require horizontal scale
> out
> > to many CPUs on multiple machines, UIMA has a vertical scale out tool,
> the
> > CPE, that can support running multiple pipeline threads on a single
> > machine.
> > More information is at
> >
> >
> http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe
> >
> >
> >
> >
> > On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein 
> wrote:
> >
> > > In this case the problem is not DUCC, rather it is the high overhead of
> > > opening small files and sending them to a remote computer individually.
> > I/O
> > > wor

Re: UIMA DUCC slow processing

2020-06-14 Thread Eddie Epstein

I forgot to add, if your application does not require horizontal scale out
to many CPUs on multiple machines, UIMA has a vertical scale out tool, the
CPE, that can support running multiple pipeline threads on a single
machine.
More information is at
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe




On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein  wrote:

> In this case the problem is not DUCC, rather it is the high overhead of
> opening small files and sending them to a remote computer individually. I/O
> works much more efficiently with larger blocks of data. Many small files
> can be merged into larger files using zip archives. DUCC sample code shows
> how to do this for CASes, and very similar code could be used for input
> documents as well.
>
> Implementing efficient scale out is highly dependent on good treatment of
> input and output data.
> Best,
> Eddie
>
>
> On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
> raja.m.sulai...@gmail.com> wrote:
>
>> Hello,
>>
>> Thank you very much for your response and even more so for the detailed
>> explanation.
>>
>> So, if I understand it correctly, DUCC is more suited for scenarios where
>> we have large input documents rather than many small ones?
>>
>> Thank you once again.
>>
>> On Fri, 12 Jun 2020, 22:18 Eddie Epstein,  wrote:
>>
>> > Hi,
>> >
>> > In this simple scenario there is a CollectionReader running in a
>> JobDriver
>> > process, delivering 100K workitems to multiple remote JobProcesses. The
>> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
>> workitems
>> > = 18 milliseconds per workitem. This time is roughly the expected
>> overhead
>> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
>> > recording the results. DUCC jobs are much more efficient if the overhead
>> > per workitem is much smaller than the processing time.
>> >
>> > Typically DUCC jobs would be processing much larger blocks of content
>> per
>> > workitem. For example, if a workitem was a document, and the document
>> > parsed into the small CASes by the CasMultiplier, the throughput would
>> be
>> > much better. However, with this example, as the number of working
>> > JobProcess threads is scaled up, the CR (JobDriver) would become a
>> > bottleneck. That's why a typical DUCC Job will not send the Document
>> > content as a workitem, but rather send a reference to the workitem
>> content
>> > and have the CasMultipliers in the JobProcesses read the content
>> directly
>> > from the source.
>> >
>> > Even though content read by the JobProcesses is much more efficient, as
>> > scaleout continued to increase for this non-computation scenario the
>> > bottleneck would eventually move to the underlying filesystem or
>> whatever
>> > document source and JobProcess output are. The main motivation for DUCC
>> was
>> > jobs similar to those in the DUCC examples which use OpenNLP to process
>> > large documents. That is, jobs where CPU processing is the bottleneck
>> > rather than I/O.
>> >
>> > Hopefully this helps. If not, happy to continue the discussion.
>> > Eddie
>> >
>> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
>> > raja.m.sulai...@gmail.com> wrote:
>> >
>> > > Hi,
>> > > Thank you for your reply and I'm sorry I couldn't get back to this
>> > > earlier.
>> > >
>> > > To get a better picture of the processing speed of DUCC, I made a
>> dummy
>> > > pipeline where the CollectionReader runs a for loop to generate 100k
>> > > workitems (so no disk reads). each workitem only has a simple string
>> in
>> > it.
>> > > These are then passed on to the CasMultiplier where for each workitem
>> I'm
>> > > creating a new CAS with DocumentInfo (again only having a simple
>> string
>> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
>> > doesn't
>> > > do anything except add the Document received in the CAS to the
>> logger. So
>> > > basically this pipeline isn't doing anything, no Input reads and the
>> only
>> > > output is the information added to the logger. Running this on the
>> > cluster
>> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
>> > than
>> > > 30 minutes. I don't understand how is this possible since there's

Re: UIMA DUCC slow processing

2020-06-14 Thread Eddie Epstein

In this case the problem is not DUCC, rather it is the high overhead of
opening small files and sending them to a remote computer individually. I/O
works much more efficiently with larger blocks of data. Many small files
can be merged into larger files using zip archives. DUCC sample code shows
how to do this for CASes, and very similar code could be used for input
documents as well.

Implementing efficient scale out is highly dependent on good treatment of
input and output data.
Best,
Eddie


On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
raja.m.sulai...@gmail.com> wrote:

> Hello,
>
> Thank you very much for your response and even more so for the detailed
> explanation.
>
> So, if I understand it correctly, DUCC is more suited for scenarios where
> we have large input documents rather than many small ones?
>
> Thank you once again.
>
> On Fri, 12 Jun 2020, 22:18 Eddie Epstein,  wrote:
>
> > Hi,
> >
> > In this simple scenario there is a CollectionReader running in a
> JobDriver
> > process, delivering 100K workitems to multiple remote JobProcesses. The
> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
> workitems
> > = 18 milliseconds per workitem. This time is roughly the expected
> overhead
> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
> > recording the results. DUCC jobs are much more efficient if the overhead
> > per workitem is much smaller than the processing time.
> >
> > Typically DUCC jobs would be processing much larger blocks of content per
> > workitem. For example, if a workitem was a document, and the document
> > parsed into the small CASes by the CasMultiplier, the throughput would be
> > much better. However, with this example, as the number of working
> > JobProcess threads is scaled up, the CR (JobDriver) would become a
> > bottleneck. That's why a typical DUCC Job will not send the Document
> > content as a workitem, but rather send a reference to the workitem
> content
> > and have the CasMultipliers in the JobProcesses read the content directly
> > from the source.
> >
> > Even though content read by the JobProcesses is much more efficient, as
> > scaleout continued to increase for this non-computation scenario the
> > bottleneck would eventually move to the underlying filesystem or whatever
> > document source and JobProcess output are. The main motivation for DUCC
> was
> > jobs similar to those in the DUCC examples which use OpenNLP to process
> > large documents. That is, jobs where CPU processing is the bottleneck
> > rather than I/O.
> >
> > Hopefully this helps. If not, happy to continue the discussion.
> > Eddie
> >
> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
> > raja.m.sulai...@gmail.com> wrote:
> >
> > > Hi,
> > > Thank you for your reply and I'm sorry I couldn't get back to this
> > > earlier.
> > >
> > > To get a better picture of the processing speed of DUCC, I made a dummy
> > > pipeline where the CollectionReader runs a for loop to generate 100k
> > > workitems (so no disk reads). each workitem only has a simple string in
> > it.
> > > These are then passed on to the CasMultiplier where for each workitem
> I'm
> > > creating a new CAS with DocumentInfo (again only having a simple string
> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
> > doesn't
> > > do anything except add the Document received in the CAS to the logger.
> So
> > > basically this pipeline isn't doing anything, no Input reads and the
> only
> > > output is the information added to the logger. Running this on the
> > cluster
> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
> > than
> > > 30 minutes. I don't understand how is this possible since there's no
> > heavy
> > > I/O processing is happening in the code.
> > >
> > > Any ideas please?
> > >
> > > Thank you.
> > >
> > > On 2020/05/18 12:47:41, Eddie Epstein  wrote:
> > > > Hi,
> > > >
> > > > Removing the AE from the pipeline was a good idea to help isolate the
> > > > bottleneck. The other two most likely possibilities are the
> collection
> > > > reader pulling from elastic search or the CAS consumer writing the
> > > > processing output.
> > > >
> > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > > > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > > &g

Re: UIMA DUCC slow processing

2020-06-12 Thread Eddie Epstein

Hi,

In this simple scenario there is a CollectionReader running in a JobDriver
process, delivering 100K workitems to multiple remote JobProcesses. The
processing time is essentially zero.  (30 * 60 seconds) / 100,000 workitems
= 18 milliseconds per workitem. This time is roughly the expected overhead
of a DUCC jobDriver delivering workitems to remote JobProcesses and
recording the results. DUCC jobs are much more efficient if the overhead
per workitem is much smaller than the processing time.

Typically DUCC jobs would be processing much larger blocks of content per
workitem. For example, if a workitem was a document, and the document
parsed into the small CASes by the CasMultiplier, the throughput would be
much better. However, with this example, as the number of working
JobProcess threads is scaled up, the CR (JobDriver) would become a
bottleneck. That's why a typical DUCC Job will not send the Document
content as a workitem, but rather send a reference to the workitem content
and have the CasMultipliers in the JobProcesses read the content directly
from the source.

Even though content read by the JobProcesses is much more efficient, as
scaleout continued to increase for this non-computation scenario the
bottleneck would eventually move to the underlying filesystem or whatever
document source and JobProcess output are. The main motivation for DUCC was
jobs similar to those in the DUCC examples which use OpenNLP to process
large documents. That is, jobs where CPU processing is the bottleneck
rather than I/O.

Hopefully this helps. If not, happy to continue the discussion.
Eddie

On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
raja.m.sulai...@gmail.com> wrote:

> Hi,
> Thank you for your reply and I'm sorry I couldn't get back to this
> earlier.
>
> To get a better picture of the processing speed of DUCC, I made a dummy
> pipeline where the CollectionReader runs a for loop to generate 100k
> workitems (so no disk reads). each workitem only has a simple string in it.
> These are then passed on to the CasMultiplier where for each workitem I'm
> creating a new CAS with DocumentInfo (again only having a simple string
> value) and pass it as a newcas to the CasConsumer. The CasConsumer doesn't
> do anything except add the Document received in the CAS to the logger. So
> basically this pipeline isn't doing anything, no Input reads and the only
> output is the information added to the logger. Running this on the cluster
> with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more than
> 30 minutes. I don't understand how is this possible since there's no heavy
> I/O processing is happening in the code.
>
> Any ideas please?
>
> Thank you.
>
> On 2020/05/18 12:47:41, Eddie Epstein  wrote:
> > Hi,
> >
> > Removing the AE from the pipeline was a good idea to help isolate the
> > bottleneck. The other two most likely possibilities are the collection
> > reader pulling from elastic search or the CAS consumer writing the
> > processing output.
> >
> > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > Please give a more complete picture of the processing scenario on DUCC.
> >
> > Regards,
> > Eddie
> >
> >
> > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > sulem...@edgehill.ac.uk> wrote:
> >
> > > Hi,
> > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> nodes
> > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> data
> > > from an Elasticsearch index and dump it into a new index after certain
> > > analysis engine processing. The Analysis Engine is a simple sentiment
> > > analysis code. The performance I'm getting is very slow as it is only
> able
> > > to process ~150 documents/minute.
> > > To test the performance without the analysis engine, I removed the AE
> from
> > > the pipeline but still I did not get any improvement in the processing
> > > speeds. Can you please guide me as to where I might be going wrong or
> what
> > > I can do to improve the processing speeds?
> > >
> > > Thank you.
> > > 
> > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > > Teaching Excellence Framework Gold Award<
> http://ehu.ac.uk/tef/emailfooter>
> > > 
> > > This message is private and confidential. If you have received this
> > > message in error, please notify the sender and remove it from your
> system.
> > > Any views or opinions presented are solely those of the author and do
> not
> > > necessarily represent those of Edge Hill or associated companies. Edge
> Hill
> > > University may monitor email traffic data and also the content of
> email for
> > > the purposes of security and business communications during staff
> absence.<
> > > http://ehu.ac.uk/itspolicies/emailfooter>
> > >
> >
>

Re: UIMA DUCC slow processing

2020-05-18 Thread Eddie Epstein

Hi,

Removing the AE from the pipeline was a good idea to help isolate the
bottleneck. The other two most likely possibilities are the collection
reader pulling from elastic search or the CAS consumer writing the
processing output.

DUCC Jobs are a simple way to scale out compute bottlenecks across a
cluster. Scaleout may be of limited or no value for I/O bound jobs.
Please give a more complete picture of the processing scenario on DUCC.

Regards,
Eddie


On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
sulem...@edgehill.ac.uk> wrote:

> Hi,
> I've been trying to run a very small UIMA DUCC cluster with 2 slave nodes
> having 32GB of RAM each. I wrote a custom Collection Reader to read data
> from an Elasticsearch index and dump it into a new index after certain
> analysis engine processing. The Analysis Engine is a simple sentiment
> analysis code. The performance I'm getting is very slow as it is only able
> to process ~150 documents/minute.
> To test the performance without the analysis engine, I removed the AE from
> the pipeline but still I did not get any improvement in the processing
> speeds. Can you please guide me as to where I might be going wrong or what
> I can do to improve the processing speeds?
>
> Thank you.
> 
> Edge Hill University
> Teaching Excellence Framework Gold Award
> 
> This message is private and confidential. If you have received this
> message in error, please notify the sender and remove it from your system.
> Any views or opinions presented are solely those of the author and do not
> necessarily represent those of Edge Hill or associated companies. Edge Hill
> University may monitor email traffic data and also the content of email for
> the purposes of security and business communications during staff absence.<
> http://ehu.ac.uk/itspolicies/emailfooter>
>

Re: Use of CASes with sofaURI?

2019-10-25 Thread Eddie Epstein

Besides very large documents and remote data, another major motivation was
for non-text data, such as audio or video.
Eddie

On Fri, Oct 25, 2019 at 1:33 PM Marshall Schor  wrote:

> Hi,
>
> Here's what I vaguely remember was the driving use-cases for the sofa as a
> URI.
>
> 1.  The main use case was for applications where the data was so large, it
> would
> be unreasonable to read it all in and save as a string.
>
> 2.  The prohibition on changing a sofa spec (without resetting the CAS)
> was that
> it has the potential for users to invalidate the results, in this
> (imagined)
> scenario:
>
> a) User creates cas with some sofa data,
> b) User runs annotators, which create annotations that "point into"
> the sofa
> data
> c) User changes the sofa spec, to different data, but now all the
> annotations still are pointing into "offsets" in the original data.
>
> You can change the sofa data setting, but only after resetting the CAS.
>
> Did you have a use case for wanting to change the sofa data without
> resetting the CAS?
>
>
> It sounds like you have another interesting use case:
>
> a) want to convert the sofa data uri -> a string and have the normal
> getDocumentText etc. work, but
> b) have the serialization serialize the sofaURI, and not the data
> that's
> present there.
>
> This might be a nice convenience.
>
> I can see a couple of issues:
>   a) it might need to have a good strategy for handling very large data.
> E.g.,
> the convert method might need to include a max string size spec.
>   b) Since the serialization would serialize the annotations, but not the
> data
> (it would only serialize the URI), the data at that URI could easily
> change,
> making the annotation results meaningless.  Perhaps some "fingerprinting"
> (developing a checksum of the data, and serializing that to be able to
> signal if
> that did happen) would be a reasonable protection.
>
> Maybe do a new feature-request issue?
>
> -Marshall
>
> magine the JavaDoc for this method would be saying something like: has the
> potential to exceed your memory, at run time, due to the potential size of
> the
> data...
>
>
> On 10/25/2019 12:59 PM, Richard Eckart de Castilho wrote:
> > Hi,
> >
> > On 25. Oct 2019, at 17:53, Marshall Schor  wrote:
> >> One other useful sources for examples:  The test cases for UIMA, e.g.
> search the
> >> uimaj-core projects *.java files for "getSofaDataStream".
> > Ok, let me elaborate :)
> >
> > One can use setSofaDataURI(url) to tell the CAS that the sofa data is
> actually external.
> > One can then use getSofaDataStream() resolve the URL and retrieve the
> data as a stream.
> >
> > So let's assume I have a CAS containing annotations on a text and the
> text is in an external file:
> >
> >   CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null,
> null, null);
> >   cas.setSofaDataURI("file:/path/to/my/file", "text/plain");
> >
> > Works nice when I use getSofaDataStream() to retrieve the data.
> >
> > But I can't use the "normal" methods like getDocumentText() or
> getCoveredText() at all.
> >
> > Also, I cannot call setSofaDataString(urlContent, "text/plain") - it
> throws an exception
> > because there is already a sofaURI set. This is a major inconvenience.
> >
> > The ClearTK guys came up with an approach that tries to make this a bit
> more convenient:
> >
> > * they introduce a well-known view named "UriView" and set the
> sofaDataURI in that view.
> > * then they use a special reader which looks up the URI in that view,
> resolves it and
> >   drops the content into the sofaDataString of the "_defaultView".
> >
> > That way they get the benefit of the externally stored sofa as well as
> the ability to use
> > the usual methods to access the text.
> >
> > When I looked at setSofaDataURI(), I naively expected that it would be
> resolved the first
> > time I try to access the sofa data (e.g. via getDocumentText()) - but
> that doesn't happen.
> >
> > Then I expected that I would just call getSofaDataStream() and manually
> drop the contents
> > into setSofaDataString() and that this data string would be "transient",
> i.e. not saved
> > into XMI because we already have a setSofaDataURI set... but that
> expectation was also
> > not fulfilled.
> >
> > Could it be useful to introduce some place where we can transiently drop
> data obtained
> > from the sofaDataURI such that methods like getDocumentText() and
> getCoveredText() do
> > something useful but also such that the data is not included when
> serializing the CAS to
> > whatever format?
> >
> > Cheers,
> >
> > -- Richard
>

Re: DUCC without shared file system

2019-09-05 Thread Eddie Epstein

Unless all CLI/API submissions are done from the head node, DUCC still has
a dependency on a shared filesystem to authenticate such requests for
configurations where user processes run with user credentials.

On Wed, Sep 4, 2019 at 9:41 AM Lou DeGenaro  wrote:

> The DUCC Book for the Apache-UIMA DUCC demo
> http://uima-ducc-demo.apache.org:42133/doc/duccbook.html has been updated
> with respect to Jira 6121.  In particular, see section 12.9 for an example
> use of ducc_rsync to install DUCC on an additional worker node when not
> using a shared filesystem for $DUCC_HOME.
>
> On Tue, Sep 3, 2019 at 5:06 PM Lou DeGenaro 
> wrote:
>
> > I opened Jira https://issues.apache.org/jira/browse/UIMA-6121 to track
> > this issue.
> >
> >
> > On Tue, Sep 3, 2019 at 1:51 PM Lou DeGenaro 
> > wrote:
> >
> >> You not need do anything special to run DUCC without a shared
> >> filesystem.  Simply install it on a local filesystem.  However, there is
> >> one caveat.  If the user's (e.g. DUCC jobs) log directory is not in a
> >> shared filesystem, then DUCC-Mon will not have access and the contents
> >> won't be viewable. I'll open a Jira to review the DUCC Book and
> fix/clarify
> >> shared file system requirements.
> >>
> >> Lou.
> >>
> >> On Tue, Sep 3, 2019 at 11:58 AM Wahed Hemati <
> hem...@em.uni-frankfurt.de>
> >> wrote:
> >>
> >>> Hi there,
> >>>
> >>> the release notes of DUCC 3.0.0 indicates, that one major change is,
> >>> that DUCC can now run without shared file system.
> >>>
> >>> How do I set this up? In the Duccbook however it says that you need a
> >>> shared filesystem to add more nodes
> >>> (https://uima.apache.org/d/uima-ducc-3.0.0/duccbook.html#x1-22400012.9
> ).
> >>>
> >>> Thanks in advance.
> >>>
> >>> -Wahed
> >>>
> >>>
>

Re: Customizing Sample Pinger of Uima

2019-04-30 Thread Eddie Epstein

Hi Florian,

Interesting questions. First, yes the intended behavior is to leave 1
instance running. Services are either started by having autostart=true, or
by a job or another service having a dependency on the service. Logically
it could be possible to let a pinger stop all instances and have the
service still be in some kind of "running" state so that the pinger would
continue running and be able to restart instances when it detected a need;
all that is needed is a bit of programming :)

A hacky approach would be not to use autostart, rather to start service-A
by using a dummy service-B with a dependency on A. When service A pinger
wants to stop A, it could issue a command to stop B which would allow
service A to be stopped. Restarting A would require an external program
requesting B to be started again.

For the second question, the answer is yes for UIMA-AS services. The latest
version of UIMA-AS supports sending process requests to specific service
instances. A pinger could send such requests, and when an instance fails to
reply the pinger can direct that instance to be stopped and another
instance started. The answer is also yes for custom services for which the
pinger knows how to address each instance.

Regards,
Eddie

On Tue, Apr 30, 2019 at 1:43 PM Florian  wrote:

> Hello everyone,
>
> I have two questions about the given sample pinger example of Uima.
>
> It is possible to set the minimal numbers of instances of a service to
> zero? If I set the min-variable to zero uima is always starting a new
> instance, when the last one is shutdown. Is this behavior intended or
> is there a way to prevent the start of a new instance, when there is no
> calls to the service? As we have some services that a rarely used, we
> would only like to start instances on demand.
>
> Secondly is there also a option to call specific instances of a service
> and restart them? We would like to do health checks for individual
> instances and restart them if needed.
>
> Best Regards
>
> Florian
>
>
>
>

Re: DUCC Job does not work on any other language except English

2018-08-04 Thread Eddie Epstein

Hi Rohit,

Hopefully this is something fairly easy to fix. Thanks for the information.

Eddie

On Thu, Aug 2, 2018 at 2:46 AM, Rohit Yadav  wrote:

> Hi,
>
> I've tried running DUCC Job for various languages but all the content is
> replaced by (Question Mark)
>
> But for english it works fine.I was wondering maybe this is a problem in
> configuration of DUCC.
>
> Any idea about this?
>
> Best,
>
> Rohit
>
>

Re: Restrict resource of a DUCC node

2018-07-25 Thread Eddie Epstein

Hi Erik,

Your user ID has hit the limit for "max user processes" on the machine.
Note that processes and threads are the same in Linux, and a single JVM may
spawn many threads (for example many GC threads :)  This parameter used to
be ulimited for users, but there was a change in Red Hat distros to limit
users to 1024 or so a few years ago. The system admin will have to raise
the limit for users. On redhat the configuration needs to be in
/etc/security/limits.d/90-nproc.conf for RHEL v7.x

Eddie

On Wed, Jul 25, 2018 at 9:23 AM, Erik Fäßler 
wrote:

> Hi all,
>
> is there a way to tell DUCC how much resources of a node it might allocate
> to jobs? My issue is that when I let DUCC scale out my jobs with an
> appropriate memory definition via process_memory_size, I get a lot of “Init
> Fails” where each failed process log shows
>
> #
> # There is insufficient memory for the Java Runtime Environment to
> continue.
> # Cannot create GC thread. Out of system resources.
>
>
>
>
> If I raise the memory requirement per job to like 40GB (which they never
> require), this issue does not come up because only few processes get
> startet but most CPU cores are not required, then, wasting time.
>
> I can’t use the machines exclusively for DUCC, so can I tell DUCC somehow
> how many resources (CPUs, memory) it may allocate?
>
> Thanks,
>
> Erik

Re: DUCC services statistics

2018-07-19 Thread Eddie Epstein

Hi,

As you may see, the default DUCC pinger for UIMA-AS services scraps JMX
stats from the service broker to report the number of producers and
consumers, the queue depth and a few other things. This pinger also does
stat reset operations on each call to the pinger, I think every 2 minutes.
A custom pinger can easily be created that does not do the reset, or even
the getMeta call if desired. The string that pingers return to DUCC are
displayed as hover text over the entries in the "State" column, for example
over "Available".

Eddie

On Thu, Jul 19, 2018 at 7:31 AM, Daniel Baumartz <
bauma...@stud.uni-frankfurt.de> wrote:

> Hi,
>
> we have a DUCC installation with different services that are being used by
> jobs and external clients. We want to collect some statistics for a
> monitoring tool that are not included in the DUCC dashboard, e.g. how often
> and when a service/queue has been used, which services are used by which
> client at the moment...
>
> It looks like I could get these information by monitoring the ActiveMQ
> messages with an external program, or by using a custom Pinger? What would
> be the best way to handle this?
>
> Thanks,
> Daniel
>

Re: High CPU Load on Job Driver

2018-07-19 Thread Eddie Epstein

Hi Rohit,

What is the collection reader running in the job driver doing? Look at the
memory use (RSS) value for the job driver on the job details page. If
nothing is logged (be sure to check ducc.log file) my guess would be that
the JD ran out of RAM in its cgroup and was killed. The JD cgroup size can
be increased dynamically for future jobs without DUCC restart by editing
ducc.properties (ducc.jd.share.quantum) The default Xmx for JD is specified
by ducc.driver.jvm.args.

Eddie

On Wed, Jul 18, 2018 at 9:12 AM, Rohit yadav  wrote:

> Hi,
>
> I am running a job on DUCC along with CR,AE and CC.But while running Job
> Driver ,CPU consumption of JD increases to 600%.
> I wanted to know is it normal to have high CPU consumption in Job Driver.
> Also after running for 1-2 hours the DUCC stops.
> And after i restart the DUCC and check the logs of JD there is nothing
> mentioned why DUCC stopped.
> Also, my DUCC is running on 3 nodes.
> The JD is configured at the head node.
> --
> Best,
> *Rohit Yadav*
>

Re: run existing AE instance on different view

2018-07-10 Thread Eddie Epstein

I think the UIMA code uses the annotator context to map the _InitialView
and the context remains static for the life of the annotator. Replicating
annotators to handle different views has been used here too, but agree it
is ugly.

If the annotator code can be changed, then one approach would be to put
some information in a fixed _IntialView that specifies which named view(s)
should be analyzed and have all down stream annotators use that to select
the view(s) to operate on.

Also sounds possible to have a single new component use the cascopier to
create a new view that is always the one processed.

Regards,
Eddie

On Thu, Jul 5, 2018 at 8:52 AM, Jens Grivolla  wrote:

> Hi,
>
> I'm trying to run an already instantiated AE on a view other than
> _InitialView. Unfortunately, I can't just call process() on the desired
> view, as there is a call to Util.getStartingView(...)
> in PrimitiveAnalysisEngine_impl that forces it back to _InitialView.
>
> The view mapping methods I found (e.g. using and AggregateBuilder) work on
> AE descriptions, so I would need to create additional instances (with the
> corresponding memory overhead). Is there a way to remap/rename the views in
> a JCas before calling process() so that the desired view is seen as the
> _InitialView? It looks like CasCopier.copyCasView(..) could maybe be used
> for this, but it doesn't feel quite right.
>
> Best,
> Jens
>

Re: Problem in running DUCC Job for Arabic Language

2018-07-05 Thread Eddie Epstein

So if you run the AE as a DUCC UIMA-AS service and send it CASes from some
UIMA-AS client it works OK? The full environment for all processes that
DUCC launches are available via ducc-mon under the Specification or
Registry tab for that job or managed reservation or service. Please see if
the LANG setting for the service is different from the LANG setting for the
job.

One can also see the LANG setting for a linux process-id by doing:

cat /proc//environ

The LANG to be used for a DUCC process can be set by adding to the
--environment argument "LANG=xxx" as needed

Thanks,
Eddie



On Thu, Jul 5, 2018 at 6:47 AM, rohit14csu...@ncuindia.edu <
rohit14csu...@ncuindia.edu> wrote:

> Hey,
>  Yeah you got it right the first snippet comes in CR before the data goes
> in CAS.
> And the second snippet is in the first annotator or analysis engine(AE) of
> my Aggregate Desciptor.
> I am pretty sure this is an issue of the CAS used by DUCC because if i use
> service of DUCC in which we are supposed to send the CAS and receive the
> same CAS with added features from DUCC i get accurate results.
>
> But the only problem comes in submitting a job where the cas is generated
> by DUCC.
> This can also be a issue of the enviornment(Language) of DUCC because the
> default language is english.
>
> Bets Regards
> Rohit
>
> On 2018/07/03 13:11:50, Eddie Epstein  wrote:
> > Rohit,
> >
> > Before sending the data into jcas if i force encode it :-
> > >
> > > String content2 = null;
> > > content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> > > jcas.setDocumentText(content2);
> > >
> >
> > Where is this code, in the job CR?
> >
> >
> >
> > >
> > > And when i go in my first annotator i force decode it:-
> > >
> > > String content = null;
> > > content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> > > "UTF-8");
> > >
> >
> > And is this in the first annotator of the job process, i.e. the CM?
> >
> > Please be as specific as possible.
> >
> > Thanks,
> > Eddie
> >
>

Re: Problem in running DUCC Job for Arabic Language

2018-07-03 Thread Eddie Epstein

Rohit,

Before sending the data into jcas if i force encode it :-
>
> String content2 = null;
> content2 = new String(content.getBytes("UTF-8"), "ISO-8859-1");
> jcas.setDocumentText(content2);
>

Where is this code, in the job CR?



>
> And when i go in my first annotator i force decode it:-
>
> String content = null;
> content = new String(jcas.getDocumentText.getBytes("ISO-8859-1"),
> "UTF-8");
>

And is this in the first annotator of the job process, i.e. the CM?

Please be as specific as possible.

Thanks,
Eddie

Re: Problem in running DUCC Job for Arabic Language

2018-06-18 Thread Eddie Epstein

Hi Rohit,

In a DUCC job the CAS created by users CR in the Job Driver is serialized
into cas.xmi format, transported to the Job Process where it is
deserialized and given to the users analytics. Likely the problem is in CAS
serialization or deserialization, perhaps due to the active LANG
environment on the JD or JP machines?

Eddie

On Thu, Jun 14, 2018 at 1:48 AM, Rohit yadav  wrote:

> Hey,
>
> I use DUCC for english language and it works without any problem.
> But lately i tried deploying a job for Arabic Language and all the content
> of Arabic Text is replaced by *'?'* (Question Mark).
>
> I am extracting Data from Accumlo and after processing i send it to ES6.
>
> When i checked the log files of JD it shows that arabic data is coming
> into CR without any problem.
> But when i check another log file it shows that the moment data enters
> into my AE arabic content is replaced by Question mark.
> Please find the log files attached with this mail.
>
> I think this may be a problem of CM because the data is fine inside CR and
> the most interesting part is that if i try running the same pipeline
> through CPM  it works without any problem which means DUCC is facing some
> issue.
>
> I'll look forward to your reply.
>
> --
> Best Regards,
> *Rohit Yadav*
>

Re: [External Sender] Re: Runtime Parameters to Annotators Running as Services

2018-06-04 Thread Eddie Epstein

>From the original description I understand the scenario to be that the
service needs to access a database that is unknown at service
initialization time. Then the CAS received by the service must include a
handle to the database. The CAS would be generated by the client, which
sounds like in your case to include a collection reader. If the client is a
UIMA aggregate and the remote service is one of the delegates then any
annotator between the CR and the remote delegate could add content to the
CAS.

Sorry if I am missing something here.
Eddie

On Fri, Jun 1, 2018 at 9:39 AM, Osborne, John D  wrote:

> Thanks - when you say having the client putting the data in the CAS do you
> mean:
>
> 1) Putting in the CollectionReader which the client is instantiating
> 2) Some other mechanism of putting data into the CAS I am not aware of
>
> I had been using 1), but in the processing of refactoring my
> CollectionReader I was trying to slim it down and just have it pass
> document identifiers to the aggregate analysis engine. I'm fuzzy on whether
> 2) is an option and if so how to implement.
>
>  -John
>
>
> ________
> From: Eddie Epstein [eaepst...@gmail.com]
> Sent: Thursday, May 31, 2018 4:25 PM
> To: user@uima.apache.org
> Subject: [External Sender] Re: Runtime Parameters to Annotators Running as
> Services
>
> I may not understand the scenario.
>
> For meta-data that would modify the behavior of the analysis, for example
> changing what analysis is run for a  CAS, putting it into the CAS itself is
> definitely recommended.
>
> The example above is for the UIMA service to access the artifact itself
> from a remote source (presumably because it is even less efficient for the
> remote client to put the data into the CAS). That is certainly recommended
> for high scale out of analysis services, assuming that the remote source
> can handle the load and not become a worse bottleneck than just having the
> client put the data into the CAS.
>
> Regards,
> Eddie
>
> On Tue, May 29, 2018 at 1:33 PM, Osborne, John D 
> wrote:
>
> > What is the best practice for passing runtime meta-data about the
> analysis
> > to individual annotators when running UIMA-AS or UIMA-DUCC services? An
> > example would be  a database identifier for an analysis of many
> documents.
> > I can't pass this in as parameters to the aggregate analysis engine
> running
> > as a service, because I don't know what that identifier is until runtime
> > (when the application calls the service).
> >
> > I used to put such information in the JCas, having the CollectionReader
> > implementation do all this work. But I am striving to have a more
> > lightweight CollectionReader... The application can obviously write
> > metadata to a database or other shared resource, but then it becomes
> > incumbent on the AnalysisEngine to access that shared resources over the
> > network (slow).
> >
> > Any advice appreciated,
> >
> >  -John
> >
>

Re: Runtime Parameters to Annotators Running as Services

2018-05-31 Thread Eddie Epstein

I may not understand the scenario.

For meta-data that would modify the behavior of the analysis, for example
changing what analysis is run for a  CAS, putting it into the CAS itself is
definitely recommended.

The example above is for the UIMA service to access the artifact itself
from a remote source (presumably because it is even less efficient for the
remote client to put the data into the CAS). That is certainly recommended
for high scale out of analysis services, assuming that the remote source
can handle the load and not become a worse bottleneck than just having the
client put the data into the CAS.

Regards,
Eddie

On Tue, May 29, 2018 at 1:33 PM, Osborne, John D  wrote:

> What is the best practice for passing runtime meta-data about the analysis
> to individual annotators when running UIMA-AS or UIMA-DUCC services? An
> example would be  a database identifier for an analysis of many documents.
> I can't pass this in as parameters to the aggregate analysis engine running
> as a service, because I don't know what that identifier is until runtime
> (when the application calls the service).
>
> I used to put such information in the JCas, having the CollectionReader
> implementation do all this work. But I am striving to have a more
> lightweight CollectionReader... The application can obviously write
> metadata to a database or other shared resource, but then it becomes
> incumbent on the AnalysisEngine to access that shared resources over the
> network (slow).
>
> Any advice appreciated,
>
>  -John
>

Re: Batch Checkpoints with DUCC?

2018-05-16 Thread Eddie Epstein

Hi,

Yes, exactly. DUCC jobs that specify CM,AE, CC use a custom flow controller
that routes the WorkItem CAS as desired. By default the route is (CM,CC),
but this can be modified by the contents of the WorkItem feature structure
... http://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-1930009.5.3

Eddie


On Wed, May 16, 2018 at 2:56 AM, Erik Fäßler <erik.faess...@uni-jena.de>
wrote:

> Hey Eddie, thanks again! :-)
>
> So the idea is that the work item is the CAS that the CR sent to the CM,
> right? The work item CAS consists of a list of artifacts which are output
> by the CM, processed by the pipeline and finally cached by the CC.
> Then, I can somehow (have to read this up) have the work item CAS sent to
> the CC as the effective “batch processing complete” signal.
>
> Is that correct?
>
> > On 15. May 2018, at 20:50, Eddie Epstein <eaepst...@gmail.com> wrote:
> >
> > Hi Erik,
> >
> > There is a brief discussion of this in the duccbook in section 9.3 ...
> > https://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-1880009.3
> >
> > In particular, the 3rd option, "Flushing cached data". This assumes that
> > the batch of work to be flushed is represented by each workitem CAS.
> >
> > Regards,
> > Eddie
> >
> > On Tue, May 15, 2018 at 9:21 AM, Erik Fäßler <erik.faess...@uni-jena.de>
> > wrote:
> >
> >> And another question concerning DUCC :-)
> >>
> >> With my CPEs I use a lot the batchProcessingComplete() and
> >> collectionProcessingComplete() methods. I need them because I do a lot
> of
> >> database interactions where I need to send data in batches due to the
> >> overhead of network communication.
> >> How is that handled in DUCC? The documentation does not talk about it,
> at
> >> least it not find anything.
> >>
> >> Hints are appreciated.
> >>
> >> Thanks!
> >>
> >> Erik
>
>

Re: Batch Checkpoints with DUCC?

2018-05-15 Thread Eddie Epstein

Hi Erik,

There is a brief discussion of this in the duccbook in section 9.3 ...
https://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-1880009.3

In particular, the 3rd option, "Flushing cached data". This assumes that
the batch of work to be flushed is represented by each workitem CAS.

Regards,
Eddie

On Tue, May 15, 2018 at 9:21 AM, Erik Fäßler 
wrote:

> And another question concerning DUCC :-)
>
> With my CPEs I use a lot the batchProcessingComplete() and
> collectionProcessingComplete() methods. I need them because I do a lot of
> database interactions where I need to send data in batches due to the
> overhead of network communication.
> How is that handled in DUCC? The documentation does not talk about it, at
> least it not find anything.
>
> Hints are appreciated.
>
> Thanks!
>
> Erik

Re: DUCC job Issue

2018-04-20 Thread Eddie Epstein

DUCC is designed for multi-user environments, and in particular tries to
balance resources fairly quickly across users in a fair-share allocation.
The default mechanism used is preemption. To eliminate preemption specify a
"non-preemptable" scheduling class for jobs such as "fixed".

Other options that could be of interest include:

ducc.rm.initialization.cap
This limits allocation to jobs until initialization is successful, limiting
the "damage" to other running preemptable jobs if a new job will not even
initialize.

ducc.rm.expand.by.doubling
This limits the rate at which resources are allocated to allow gaining some
knowledge about job throughput to avoid over allocating resources.

ducc.rm.prediction
Used along with doubling to avoid unnecessary allocation.

Regards,
Eddie

On Fri, Apr 20, 2018 at 5:31 AM, priyank sharma 
wrote:

> Hey!
>
> I am facing trouble while running one job at a time in ducc. I want that
> if one job is running the other one should wait for it to complete and then
> start.
> My configuration file is attached below.
> Please help me with what I am missing.
>
> --
> Thanks and Regards
> Priyank Sharma
>
>

Re: DUCC and CAS Consumers

2018-04-11 Thread Eddie Epstein

Hi Erik,

DUCC jobs can scale out user's components in two ways, horizontally by
running multiple processes (process_deployments_max)  and vertically by
running the pipeline defined by the CM, AE and CC components in multiple
threads (process_pipeline_count).  Since the constructed top AAE is
designed to run in multiple threads, it requires multiple deployments
enabled for all pipeline components.

The CM and CC components are optional as they could be already included in
the specified process_descriptor_AE. The reason for explicitly specifying
CM and CC components is to facilitate high scale out. The Job's collection
reader should create CASes with references to data which will often be
segmented by the CM into a collection of CASes to be processed by the users
AE. The initial CAS created by the driver normally does not flow into the
AE, but typically does flow to the CC after all child CASes from the CM
have been processed to trigger the CC to finalize the collection.

More information about the job model is described in the duccbook at
https://uima.apache.org/d/uima-ducc-2.2.2/duccbook.html#x1-181000III

Regards,
Eddie

On Wed, Apr 11, 2018 at 5:16 AM, Erik Fäßler 
wrote:

> Hi all,
>
> I am doing my first steps with UIMA DUCC. I stumbled across the issue that
> my CAS consumer has allowMultipleDeployments=false since it is supposed to
> write multiple CAS document texts into one large ZIP file.
> DUCC complains about the discrepancy of the processing AAE being allowed
> for multiple deployment but one of its containers (my consumer) is not.
> I did specify the consumer with the "process_descriptor_CC” job file key
> and was assuming that DUCC would take care of it. After all, it is a key of
> its own. But it seems the consumer is just wrapped into a new AAE together
> with my annotator AAE. This new top AAE created by DUCC causes the error:
> My own AAE is allowed for multiple deployment and so are its delegates. But
> the consumer not, of course.
>
> How to handle this case? The documentation of DUCC is rather vague at this
> point. There is the section about CAS consumer changes but it doesn’t
> mention multiple deployment explicitly.
>
> What is the “process_descriptor_CC” for when it get wrapped up into an AAE
> with the user-delivered AAE anyway?
>
> Thanks and best regards,
>
> Erik
>
>

Re: Exception: UIMA - Annotator Processing Failed

2018-02-28 Thread Eddie Epstein

Hi,

An annotation feature structure can only be added to the index of the view
it was created in.

It looks like the application at
edu.cmu.lti.oaqa.baseqa.evidence.concept.PassageConceptRecognizer.process(
PassageConceptRecognizer.java:96)*
is trying to add an annotation created in one view to the index of a
different view.

Regards,
Eddie


On Wed, Feb 28, 2018 at 1:33 AM, Fatima Zulifqar <
fatimazulifqar...@gmail.com> wrote:

> Dear,
>
> I am facing the following issue while running an open source project which
> is based upon uima framework. I didn't find any solution concerned yet.
>
> *Feb 27, 2018 11:57:39 AM
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl
> callAnalysisComponentProcess(417)*
> *SEVERE: Exception occurred*
> *org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator
> processing failed.*
> * at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.
> callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)*
> * at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.
> processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)*
> * at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(
> AnalysisEngineImplBase.java:269)*
> * at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(
> AnalysisEngineImplBase.java:284)*
> * at edu.cmu.lti.oaqa.ecd.phase.BasePhase$1.run(BasePhase.java:226)*
> * at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)*
> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)*
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1149)*
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:624)*
> * at java.lang.Thread.run(Thread.java:748)*
> *Caused by: org.apache.uima.cas.CASRuntimeException: Error - the
> Annotation
> "#1 ConceptMention
> "ptv.http://www.ncbi.nlm.nih.gov/pubmed/21631649/abstract/
> 756/abstract/1073
> "*
> *   sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/21631649/abstract/756/abstract/1073
> *
> *   begin: 223*
> *   end: 232*
> *   concept: #0 Concept*
> *  names: NonEmptyStringList*
> * head: "anti gq1b"*
> * tail: EmptyStringList*
> *  uris: EmptyStringList*
> *  ids: EmptyStringList*
> *  mentions: NonEmptyFSList*
> * head: ConceptMention*
> *sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/22698187/abstract/
> 1131/abstract/1299
> *
> *begin: 32*
> *end: 41*
> *concept: #0*
> *matchedName: "anti-GQ1b"*
> *score: NaN*
> * tail: NonEmptyFSList*
> *head: #1*
> *tail: NonEmptyFSList*
> *   head: ConceptMention*
> *  sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/23927937/abstract/303/abstract/503
> *
> *  begin: 179*
> *  end: 188*
> *  concept: #0*
> *  matchedName: "anti-GQ1b"*
> *  score: NaN*
> *   tail: NonEmptyFSList*
> *  head: ConceptMention*
> * sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/19664367/abstract/0/abstract/330
> *
> * begin: 40*
> * end: 49*
> * concept: #0*
> * matchedName: "anti-GQ1b"*
> * score: NaN*
> *  tail: NonEmptyFSList*
> * head: ConceptMention*
> *sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/25379047/abstract/140/abstract/406
> *
> *begin: 133*
> *end: 142*
> *concept: #0*
> *matchedName: "anti-GQ1b"*
> *score: NaN*
> * tail: NonEmptyFSList*
> *head: ConceptMention*
> *   sofa:
> ptv.http://www.ncbi.nlm.nih.gov/pubmed/22698187/abstract/189/abstract/386
> *
> *   begin: 3*
> *   end: 12*
> *   concept: #0*
> *   matchedName: "anti-GQ1b"*
> *   score: NaN*
> *tail: NonEmptyFSList*
> *   head: ConceptMention*
> *  sofa:
>

Re: Completion event for replicated components

2018-01-18 Thread Eddie Epstein

There will be a new mechanism to help do this in the upcoming
uima-as-2.10.2 version. This version includes an additional listener on
every service that can be addressed individually. A uima-as client could
then iterate thru all service instances calling CPC, assuming the client
knew about all existing instances.

This does not solve the problem for replicated components in the same
service instance. For that the thread receiving the CPC would have to use
generic methods to trigger CPC processing in all the other threads.

Eddie

On Thu, Jan 18, 2018 at 7:54 AM, n7...@t-online.de 
wrote:

> Hi,
>
> in chapter 1.5.2 in UIMA AS documentation
> https://uima.apache.org/d/uima-as-2.9.0/uima_async_
> scaleout.html#ugr.async.ov.concepts.collection_process_complete
> its stated that only one instance will receive the
> collectionProcessComplete call, if components are replicated.
>
> What is the best way to get collectionProcessComplete() or something else
> called in the replicated consumer components, when the collection is
> finished, in order to complete writing any output?
>
> Thanks and best regards,
> John
>
> 
>
> 
> Gesendet mit Telekom Mail  -
> kostenlos
> und sicher für alle!

Re: Ducc Service Registration Error

2017-11-20 Thread Eddie Epstein

Hi,

Annotator class "org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator"
was not found ... means that this class is not in the classpath specified
by the registration.

Eddie

On Mon, Nov 20, 2017 at 9:17 AM, priyank sharma 
wrote:

> Hi!
>
> When i am registering the service on the ducc it is not able to start and
> giving the error
>
> WARNING:
> org.apache.uima.resource.ResourceInitializationException: Annotator class
> "org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator" was not
> found. (Descriptor: file:/home/ducc/Uima_Sentiment
> _NLP/desc/orkash/ae/TreeBankChunkerDescriptor.xml)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:220)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initialize(PrimitiveAnalysisEngine_impl.java:170)
> at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResou
> rce(AnalysisEngineFactory_impl.java:94)
> at org.apache.uima.impl.CompositeResourceFactory_impl.produceRe
> source(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> java:279)
> at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFram
> ework.java:407)
> at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_
> impl.java:256)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initASB(AggregateAnalysisEngine_impl.java:429)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.
> java:373)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initialize(AggregateAnalysisEngine_impl.java:186)
> at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResou
> rce(AnalysisEngineFactory_impl.java:94)
> at org.apache.uima.impl.CompositeResourceFactory_impl.produceRe
> source(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> java:279)
> at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFram
> ework.java:407)
> at org.apache.uima.aae.controller.PrimitiveAnalysisEngineContro
> ller_impl.initializeAnalysisEngine(PrimitiveAnalysisEngineCo
> ntroller_impl.java:265)
> at org.apache.uima.aae.UimaAsThreadFactory$1.run(UimaAsThreadFa
> ctory.java:120)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException:
> org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator
> at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:260)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:217)
> ... 16 more
>
> Nov 20, 2017 7:15:22 PM 
> org.apache.uima.adapter.jms.activemq.SpringContainerDeployer
> notifyOnInitializationFailure
> WARNING: Top Level Controller Initialization Exception.
> org.apache.uima.resource.ResourceInitializationException: Annotator class
> "org.orkash.annotator.AnalysisEngine.TreebankChunkerAnnotator" was not
> found. (Descriptor: file:/home/ducc/Uima_Sentiment
> _NLP/desc/orkash/ae/TreeBankChunkerDescriptor.xml)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initializeAnalysisComponent(PrimitiveAnalysisEngine_impl.java:220)
> at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine
> _impl.initialize(PrimitiveAnalysisEngine_impl.java:170)
> at org.apache.uima.impl.AnalysisEngineFactory_impl.produceResou
> rce(AnalysisEngineFactory_impl.java:94)
> at org.apache.uima.impl.CompositeResourceFactory_impl.produceRe
> source(CompositeResourceFactory_impl.java:62)
> at org.apache.uima.UIMAFramework.produceResource(UIMAFramework.
> java:279)
> at org.apache.uima.UIMAFramework.produceAnalysisEngine(UIMAFram
> ework.java:407)
> at org.apache.uima.analysis_engine.asb.impl.ASB_impl.setup(ASB_
> impl.java:256)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initASB(AggregateAnalysisEngine_impl.java:429)
> at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine
> _impl.initializeAggregateAnalysisEngine(AggregateAnalysisEngine_impl.
> java:373)
> at

Re: DUCC's job goes into infintie loop

2017-11-10 Thread Eddie Epstein

Hi Priyank,

Looks like you are running DUCC v2.0.x. There are so many bugs fixed in
subsequent versions, the latest being v2.2.1. Newer versions have a
ducc_update command that will upgrade an existing install, but given all
the changes since v2.0.x I suggest a clean install.

Eddie

On Fri, Nov 10, 2017 at 12:11 AM, priyank sharma 
wrote:

> There is nothing on the work item page and performance page on the web
> server. There is only one log file for the main node, no log files for
> other two nodes. Ducc job processes not able to pick the data from the data
> source and no UIMA aggregator is working for that batches.
>
> Are the issue because of the java heap space? We are giving 4gb ram to the
> job-process.
>
> Attaching the Log file.
>
> Thanks and Regards
> Priyank Sharma
>
> On Thursday 09 November 2017 04:33 PM, Lou DeGenaro wrote:
>
>> The first place to look is in your job's logs.  Visit the ducc-mon jobs
>> page ducchost:42133/jobs.jsp then click on the id of your job.  Examine
>> the
>> logs by clicking on each log file name looking for any revealing
>> information.
>>
>> Feel free to post non-confidential snippets here, or If you'd like to chat
>> in real time we can use hipchat.
>>
>> Lou.
>>
>> On Thu, Nov 9, 2017 at 5:19 AM, priyank sharma > >
>> wrote:
>>
>> All!
>>>
>>> I have a problem regarding DUCC cluster in which a job process gets stuck
>>> and keeps on processing the same batch again and again due to maximum
>>> duration the batch gets reason or extraordinary status
>>> *"**CanceledByUser"
>>> *and then gets restarted with the same ID's. This usually happens after
>>> 15
>>> to 20 days and goes away after restarting the ducc cluster. While going
>>> through the data store that is being used by CAS consumer to ingest data,
>>> the data regarding this batch does never get ingested. So most probably
>>> this data is not being processed.
>>>
>>> How to check if this data is being processed or not?
>>>
>>> Are the resources the issue and why it is being processed after
>>> restarting
>>> the cluster?
>>>
>>> We have three nodes cluster with  32gb ram, 40gb ram and 28 gb ram.
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Priyank Sharma
>>>
>>>
>>>
>

Re: UIMA analysis from a database

2017-09-15 Thread Eddie Epstein

DUCC does have hooks to allow entire machines to be dynamically added or
removed from a running DUCC cluster. So in principal DUCC could be run as
an application under a different resource manager as long as resources were
available at the machine level. It should also be possible to run other
infrastructures under DUCC, for example where a Hadoop/Spark subcluster is
turned on and off as required.

One not cloud friendly aspect of DUCC has been a dependency on a shared
filesystem. There has been work done recently to remove this requirement,
and the latest release can run without a shared FS, but some useful
functionality is not available. Specifically, facilitating the distribution
of user application code to worker machines, and automatically retrieving
user logfiles written to local disk to the DUCC web console.

regards,
Eddie

On Fri, Sep 15, 2017 at 2:54 PM, Fox, David <david@optum.com> wrote:

> Another thanks to all contributing to this thread.
>
> We¹re looking to transition a NLP large application processing ~30TB/month
> from a custom NLP framework to UIMA-AS, and from parallel processing on a
> dedicated cluster with custom python scripts which call gnu parallel, to
> something with better support for managing resources on a shared cluster.
>
> Both our internal IT/engineering group and our cluster vendor
> (HortonWorks) use and support Hadoop/Spark/YARN on a new shared cluster.
> DUCC¹s capabilities seem to overlap with these more general purpose tools.
>  Although it may be more closely aligned with UIMA for a dedicated
> cluster, I think the big question for us would be how/whether it would
> play nicely with other Hadoop/Spark/YARN jobs on the shared cluster.
> We¹re also likely to move at least some of our workload to a cloud
> computing host, and it seems like Hadoop/Spark are much more likely to be
> supported there.
>
> David Fox
>
> On 9/15/17, 1:57 PM, "Eddie Epstein" <eaepst...@gmail.com> wrote:
>
> >There are a few DUCC features that might be of particular interest for
> >scaling out UIMA analytics.
> >
> > - all user code for batch processing continues to use the existing UIMA
> >component model: collection readers, cas multiplers, analysis engines, and
> >cas consumers.**
> >
> > - DUCC supports assembling and debugging a single threaded process with
> >these components, and then with no code change launch a highly scaled out
> >deployment.
> >
> > - for applications that use too much RAM to be able to utilize all the
> >cores on worker machines, DUCC can do the vertical (thread) scaleout
> >needed
> >to share memory.
> >
> > - DUCC automatically captures the performance breakdown of the UIMA-based
> >processes, as well as capturing process statistics including CPU, RAM,
> >swap, pagefaults and GC. Performance breakdown info for individual tasks
> >(DUCC work items) can optionally be captured.
> >
> > - DUCC has extensive error handling, automatically resubmitting work
> >associated with uncaught exceptions, process crashes, machine failures,
> >network failures, etc.
> >
> > - Exceptions are convenient to get to, and an attempt is made to make
> >obvious things that might be tricky to find, such all the reasons a
> >process
> >might fail to start, without having to dig through DUCC framework logs.
> >
> >** DUCC services introduce a new user programmable component, a service
> >pinger, that is responsible for validating that a service is operating
> >correctly. The service pinger can also dynamically change the number of
> >instances of a service, and it can restart individual instances that are
> >determined to be acting badly.
> >
> >Eddie
> >
> >On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D <josbo...@uabmc.edu>
> >wrote:
> >
> >> Thanks Richard and Nicholas,
> >>
> >> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?
> >>
> >> It's also doing UIMA on Spark - I'm wondering if you are aware of it and
> >> how it is different from your own project?
> >>
> >> Thanks for any info,
> >>
> >>  -John
> >>
> >>
> >> 
> >> From: Richard Eckart de Castilho [r...@apache.org]
> >> Sent: Friday, September 15, 2017 5:29 AM
> >> To: user@uima.apache.org
> >> Subject: Re: UIMA analysis from a database
> >>
> >> On 15.09.2017, at 09:28, Nicolas Paris <nipari...@gmail.com> wrote:
> >> >
> >> > - UIMA-AS is another way to program UIMA
> >>
> >> Here you probably meant uimaFIT.
> >

Re: UIMA analysis from a database

2017-09-15 Thread Eddie Epstein

There are a few DUCC features that might be of particular interest for
scaling out UIMA analytics.

 - all user code for batch processing continues to use the existing UIMA
component model: collection readers, cas multiplers, analysis engines, and
cas consumers.**

 - DUCC supports assembling and debugging a single threaded process with
these components, and then with no code change launch a highly scaled out
deployment.

 - for applications that use too much RAM to be able to utilize all the
cores on worker machines, DUCC can do the vertical (thread) scaleout needed
to share memory.

 - DUCC automatically captures the performance breakdown of the UIMA-based
processes, as well as capturing process statistics including CPU, RAM,
swap, pagefaults and GC. Performance breakdown info for individual tasks
(DUCC work items) can optionally be captured.

 - DUCC has extensive error handling, automatically resubmitting work
associated with uncaught exceptions, process crashes, machine failures,
network failures, etc.

 - Exceptions are convenient to get to, and an attempt is made to make
obvious things that might be tricky to find, such all the reasons a process
might fail to start, without having to dig through DUCC framework logs.

** DUCC services introduce a new user programmable component, a service
pinger, that is responsible for validating that a service is operating
correctly. The service pinger can also dynamically change the number of
instances of a service, and it can restart individual instances that are
determined to be acting badly.

Eddie

On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D 
wrote:

> Thanks Richard and Nicholas,
>
> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?
>
> It's also doing UIMA on Spark - I'm wondering if you are aware of it and
> how it is different from your own project?
>
> Thanks for any info,
>
>  -John
>
>
> 
> From: Richard Eckart de Castilho [r...@apache.org]
> Sent: Friday, September 15, 2017 5:29 AM
> To: user@uima.apache.org
> Subject: Re: UIMA analysis from a database
>
> On 15.09.2017, at 09:28, Nicolas Paris  wrote:
> >
> > - UIMA-AS is another way to program UIMA
>
> Here you probably meant uimaFIT.
>
> > - UIMA-FIT is complicated
> > - UIMA-FIT only work with UIMA
>
> ... and I suppose you mean UIMA-AS here.
>
> > - UIMA only focuses on text Annotation
>
> Yep. Although it has also been used for other media, e.g. video and audio.
> But the core UIMA framework doesn't specifically consider these media.
> People who apply it UIMA in the context of other media do so with custom
> type systems.
>
> > - UIMA is not good at:
> >   - text transformation
>
> It is not straight-forward but possible. E.g. the text normalizers in
> DKPro Core make use of either different views for different states of
> normalization or drop the original text and forward the normalized
> text within a pipeline by means of a CAS multiplier.
>
> >   - read data from source in parallel
> >   - write data to folder in parallel
>
> Not sure if these two are limitations of the framework
> rather than of the way that you use readers and writers
> in the particular scale-out mode you are working with.
>
> >   - machine learning interface
>
> UIMA doesn't offer ML as part of the core framework because
> that is simply not within the scope of what the UIMA framework
> aims to achieve.
>
> There are various people who have built ML around UIMA, e.g.
> ClearTK (https://urldefense.proofpoint.com/v2/url?u=http-
> 3A__cleartk.github.io_cleartk_=DwICAw=o3PTkfaYAd6-No7SurnLtwPssd47t-
> De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=tAU9eh1Sq_D-
> L1P4GfuME4SQleRf9q_7Ll9siim5W0c=J1-BGfzlrX9t3-
> Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM= ) or DKPro TC
> (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__dkpro.github.io_dkpro-2Dtc_=DwICAw=o3PTkfaYAd6-No7SurnLtwPssd47t-
> De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=tAU9eh1Sq_D-
> L1P4GfuME4SQleRf9q_7Ll9siim5W0c=kye5D2izwKE_9V2QQW8leiKp0p-91U-
> CFwXJMFmCd3w= ) - and as you did, it
> can be combined in various ways with ML frameworks that
> specialize specifically on ML.
>
>
> Cheers,
>
> -- Richard
>
>
>

Re: DUCC job automatically fails and gives Reason,or extraordinary status as cancelled by User | DUCC Version: 2.0.1

2017-05-17 Thread Eddie Epstein

How long does the job run before stopping? Cancelled by user could come if
the job is submitted with cancel_on_interrupt and the client submitting the
job were stopped.

Eddie

On Tue, May 16, 2017 at 8:31 AM, Lou DeGenaro 
wrote:

> Dunno why the connection would be refused.  Are the JD and JP on the same
> or different machines?  Is the network viable between the machines on which
> each is located?
>
> Lou.
>
> On Tue, May 16, 2017 at 8:18 AM, priyank sharma  >
> wrote:
>
> > Hey!
> >
> > There were no error found in JD log.Following is a snippet of the jD log
> >
> > 14 May 2017 18:47:39,593  INFO ActionGet - T[482] engage  seqNo=3484
> > remote=S144.3170.35
> > 14 May 2017 18:47:39,641  INFO ActionGet - T[283] engage  seqNo=3485
> > remote=S144.2443.34
> > 14 May 2017 18:47:40,688  INFO ActionEnd - T[284] engage  seqNo=3470
> > remote=S144.2443.36 ended
> > in getNext
> > 14 May 2017 18:47:40,736  INFO ActionGet - T[483] engage  seqNo=3486
> > remote=S144.2443.36
> > 14 May 2017 18:47:43,207  INFO ActionEnd - T[482] engage  seqNo=3477
> > remote=S144.3346.32 ended
> > in getNext
> > 14 May 2017 18:47:43,254  INFO ActionGet - T[284] engage  seqNo=3487
> > remote=S144.3346.32
> > 14 May 2017 18:47:43,258  INFO ActionEnd - T[283] engage  seqNo=3467
> > remote=S144.2443.35 ended
> > in getNext
> > 14 May 2017 18:47:43,296  INFO ActionGet - T[483] engage  seqNo=3488
> > remote=S144.2443.35
> > 14 May 2017 18:47:44,425  INFO ActionEnd - T[283] engage  seqNo=3468
> > remote=S144.3346.34 ended
> > in getNext
> > 14 May 2017 18:47:44,605  INFO ActionGet - T[483] engage  seqNo=3489
> > remote=S144.3346.34
> > 14 May 2017 18:47:46,105  INFO ActionEnd - T[283] engage  seqNo=3480
> > remote=S144.3346.33 ended
> > in getNext
> > 14 May 2017 18:47:46,166  INFO ActionGet - T[482] engage  seqNo=3490
> > remote=S144.3346.33
> > 14 May 2017 18:47:46,233  INFO ActionEnd - T[284] engage  seqNo=3478
> > remote=S144.3346.36 ended
> > in getNext
> > 14 May 2017 18:47:46,415  INFO ActionGet - T[482] engage  seqNo=3491
> > remote=S144.3346.36
> > 14 May 2017 18:47:49,924  INFO ActionEnd - T[284] engage  seqNo=3475
> > remote=S144.3348.35 ended
> > in getNext
> > 14 May 2017 18:47:49,968  INFO ActionGet - T[482] engage  seqNo=3492
> > remote=S144.3348.35
> > 14 May 2017 18:47:50,856  INFO ActionEnd - T[283] engage  seqNo=3469
> > remote=S144.3348.32 ended
> > in getNext
> > 14 May 2017 18:47:50,918  INFO ActionGet - T[284] engage  seqNo=3493
> > remote=S144.3348.32
> > 14 May 2017 18:47:53,566  INFO ActionEnd - T[284] engage  seqNo=3459
> > remote=S144.2443.33 ended
> > in getNext
> > 14 May 2017 18:47:53,599  INFO ActionGet - T[483] engage  seqNo=3494
> > remote=S144.2443.33
> > 14 May 2017 18:47:58,507  INFO ActionEnd - T[283] engage  seqNo=3473
> > remote=S144.3348.36 ended
> > in getNext
> > 14 May 2017 18:47:58,565  INFO ActionGet - T[284] engage  seqNo=3495
> > remote=S144.3348.36
> > 14 May 2017 18:48:06,218  INFO ActionEnd - T[283] engage  seqNo=3460
> > remote=S144.3348.34 ended
> > in getNext
> > 14 May 2017 18:48:06,360  INFO ActionGet - T[483] engage  seqNo=3496
> > remote=S144.3348.34
> > 14 May 2017 18:48:09,619  INFO ActionEnd - T[283] engage  seqNo=3481
> > remote=S144.2443.32 ended
> > in getNext
> > 14 May 2017 18:48:09,674  INFO ActionEnd - T[483] engage  seqNo=3479
> > remote=S144.3170.36 ended
> > 14 May 2017 18:48:09,681  INFO ActionGet - T[284] engage  seqNo=3497
> > remote=S144.2443.32
> > in getNext
> > 14 May 2017 18:48:09,814  INFO ActionGet - T[482] engage  seqNo=3498
> > remote=S144.3170.36
> > 14 May 2017 18:48:13,464  INFO ActionEnd - T[283] engage  seqNo=3476
> > remote=S144.3346.35 ended
> > in getNext
> > 14 May 2017 18:48:13,498  INFO ActionGet - T[483] engage  seqNo=3499
> > remote=S144.3346.35
> > 14 May 2017 18:48:15,116  INFO ActionEnd - T[284] engage  seqNo=3482
> > remote=S144.3170.32 ended
> > in getNext
> > 14 May 2017 18:48:15,163  INFO ActionGet - T[283] engage  seqNo=3500
> > remote=S144.3170.32
> > 14 May 2017 18:48:17,050  INFO ActionEnd - T[284] engage  seqNo=3465
> > remote=S144.3170.33 ended
> > in getNext
> > 14 May 2017 18:48:17,141  INFO ActionGet - T[482] engage  seqNo=3501
> > remote=S144.3170.33
> > 14 May 2017 18:48:19,138  INFO ActionEnd - T[284] engage  seqNo=3471
> > remote=S144.3170.34 ended
> > 14 May 2017 18:48:19,148  INFO ActionEnd - T[283] engage  seqNo=3487
> > remote=S144.3346.32 ended
> > in getNext
> > in getNext
> > 14 May 2017 18:48:19,180  INFO ActionGet - T[483] engage  seqNo=3502
> > remote=S144.3170.34
> > 14 May 2017 18:48:19,262  INFO ActionGet - T[284] engage  seqNo=3503
> > remote=S144.3346.32
> > 14 May 2017 18:48:22,923  INFO ActionEnd - T[482] engage  seqNo=3486
> > remote=S144.2443.36 ended
> > in getNext
> > 14 May 2017 18:48:22,977  INFO ActionGet - T[284] engage  seqNo=3504
> > remote=S144.2443.36
> > 14 May 2017 18:48:32,013  INFO ActionEnd - T[284] engage  seqNo=3492
> >

Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-21 Thread Eddie Epstein

Hi Erik,

A few words about DUCC and your application. DUCC is a cluster controller
that includes a resource manager and 3 applications: batch processing, long
running services and singleton processes.

The batch processing application consists of a users CollectionReader which
defines work items and a users aggregate for processing work items that can
be replicated as desired across the cluster of machines. DUCC manages the
remote process scale out and distribution of work items. The aggregate can
be vertically scaled within each process so that in-heap data can be shared
by multiple instances of the aggregate. UIMA-AS is not required for this
simple threading model.

For most applications a work item is itself a collection, a CAS containing
references to the data to be processed, where the collection size is
designed to have small enough granularity to support scale out but big
enough granularity to avoid bottlenecks.

The users aggregate normally has an initial CasMultiplier that reads the
input data and creates the CASes to be fed to the rest of the pipeline.
When all children CASes are finished processing the work item CAS is routed
to the aggregate's CasConsumer to finalize the collection. DUCC considers
the work item complete only when the work item CAS is successfully
processed.

The system is quite robust to errors: uncaught exceptions, analytics
crashing, machines crashing, etc.

Regards,
Eddie

On Fri, Apr 21, 2017 at 2:12 PM, Olga Patterson 
wrote:

> Erik,
>
> My team at the VA have developed an easy way of implementing UIMA AS
> pipelines and scaling them to a large number of nodes - using Leo framework
> that extends UIMA AS 2.8.1. We have run pipelines on over 200M documents
> scaled across multiple nodes with dozens of service instances and it
> performs great.
>
> Here is some info:
> http://department-of-veterans-affairs.github.io/Leo/
>
> The documentation for Leo reflects an earlier version of Leo, but if you
> are interested in using it with Java 8 and UIMA 2.8.1, we have not released
> the latest version in on the VA github yet but we can share it with you so
> that you can test it out and possibly provide your comments back to us.
>
> Leo has simple-to-use functionality for flexible batch read and write and
> it can work with any UIMA AEs and existing descriptor files and type system
> descriptions, so if you already have a pipeline, wrapping it with Leo would
> take just a few lines of code.
>
> Let me know if you are interested and I can help you to get started.
>
> Olga Patterson
>
>
>
>
>
>
>
> -Original Message-
> From: Jaroslaw Cwiklik 
> Reply-To: "user@uima.apache.org" 
> Date: Friday, April 21, 2017 at 8:08 AM
> To: "user@uima.apache.org" 
> Subject: Re: Synchonizing Batches AE and StatusCallbackListener
>
> Erik, thanks. This is more clear what you are trying to accomplish.
> First,
> there are no plans to retire the CPE. It is supported and I don't know
> of
> any plans to retire it. The only issue is ongoing development. My
> efforts
> are focused on extending and improving UIMA-AS.
>
> I don't have an answer yet how to handle the CPE crash scenario with
> respect to batching and subsequent restart from the last known good
> batch.
> Seems like some coordination would be needed to avoid redoing the whole
> collection after a crash. Its been awhile since I've looked at the CPE.
> Will take a look and see what is possible if anything.
>
> There is another Apache UIMA project called DUCC which stands for
> Distributed Uima Cluster Computing. From your email it looks like you
> have
> a cluster of machines available. Here is a quick description of DUCC:
>
> DUCC is a Linux cluster controller designed to scale out any UIMA
> pipeline
> for high throughput collection processing jobs as well as for low
> latency
> real-tme applications. Building on UIMA-AS, DUCC is particularly well
> suited to run large memory Java analytics in multiple threads in order
> to
> fully utilize multicore machines. DUCC manages the life cycle of all
> processes deployed across the cluster, including non-UIMA processes
> such as
> tomcat servers or VNC sessions.
>
>  You can find more info on this here:
> https://uima.apache.org/doc-uimaducc-whatitam.html
>
> In UIMA-AS batching is an application concern. I am a bit fuzzy on
> implementation so perhaps someone else can comment how to implement
> batching and how to handle errors. You can use a CasMultipler and a
> custom
> FlowController to manage CASes and react to errors.The UIMA-AS service
> can
> take an input CAS representing your batch, pass it on to the
> CasMultiplier,
> generate CASes for each piece of work and deliver results to the
> CasConsumer with a FlowController in the middle orchestrating the
> flow. I
> defer to

Re: Free instance of agreggate with cas multiplier in MultiprocessingAnalysisEngine

2016-11-09 Thread Eddie Epstein

Sounds like a bug in MultiprocessingAnalysisEngine_impl. Any chance you
could simplify your scenario and attach it to a Jira issue against UIMA?

On Wed, Nov 9, 2016 at 1:24 PM, nelson rivera <nelsonriver...@gmail.com>
wrote:

> Not, for only one instance, the behavior is correct, and generate all
> child cas required.
>
> 2016-11-09 9:40 GMT-05:00, Eddie Epstein <eaepst...@gmail.com>:
> > Is behavior the same for single-threaded AnalysisEngine instantiation?
> >
> > On Tue, Nov 8, 2016 at 10:00 AM, nelson rivera <nelsonriver...@gmail.com
> >
> > wrote:
> >
> >> I have a aggregate analysis engine that contains a casmultiplier
> >> annotator. I instantiate this aggregate with the interface
> >> UIMAFramework.produceAnalysisEngine(specifier, 1, 0) for multithreaded
> >> processing. The casmultiplier generate more than one cas for each
> >> input CAS. The issue is that after first cas child, that i get with
> >>
> >>  JCasIterator casIterator =
> >> analysisEngine.processAndOutputNewCASes(jcas);
> >> while (casIterator.hasNext()) {
> >>JCas outCas = casIterator.next();
> >>...
> >> outCas.release();
> >> }
> >>
> >> after this first cas child, the MultiprocessingAnalysisEngine_impl
> >> assumes that the instance of
> >> AggregateAnalysisEngine that processes the request has ended, Y
> >> entonces esta instancia es libre para procesar otra solicitud de otro
> >> hilo, and it is not true, because missing child cas, producing
> >> concurrency errors.
> >>
> >> What is the condition of a instance of MultiprocessingAnalysisEngine
> >> that contains cas multiplier that generate many cas child for each
> >> input Cas, for determine that it finish and is free?
> >>
> >
>

Re: Free instance of agreggate with cas multiplier in MultiprocessingAnalysisEngine

2016-11-09 Thread Eddie Epstein

Is behavior the same for single-threaded AnalysisEngine instantiation?

On Tue, Nov 8, 2016 at 10:00 AM, nelson rivera 
wrote:

> I have a aggregate analysis engine that contains a casmultiplier
> annotator. I instantiate this aggregate with the interface
> UIMAFramework.produceAnalysisEngine(specifier, 1, 0) for multithreaded
> processing. The casmultiplier generate more than one cas for each
> input CAS. The issue is that after first cas child, that i get with
>
>  JCasIterator casIterator = analysisEngine.processAndOutputNewCASes(jcas);
> while (casIterator.hasNext()) {
>JCas outCas = casIterator.next();
>...
> outCas.release();
> }
>
> after this first cas child, the MultiprocessingAnalysisEngine_impl
> assumes that the instance of
> AggregateAnalysisEngine that processes the request has ended, Y
> entonces esta instancia es libre para procesar otra solicitud de otro
> hilo, and it is not true, because missing child cas, producing
> concurrency errors.
>
> What is the condition of a instance of MultiprocessingAnalysisEngine
> that contains cas multiplier that generate many cas child for each
> input Cas, for determine that it finish and is free?
>

Re: java.lang.ClassCastException with binary SerializationStrategy

2016-11-03 Thread Eddie Epstein

Is a collection reader being plugged into the UimaAsynchronousEngine? If so
does its component descriptor define or import any types? Sorry to say,
given that Xmi works the most likely problem is still type system mismatch.

Eddie

On Thu, Nov 3, 2016 at 5:37 PM, nelson rivera <nelsonriver...@gmail.com>
wrote:

> Yes with xmiCas serialization everything works fine. The client and
> the input Cas have identical type system definitions, because i get
> the cas from  UimaAsynchronousEngine with the line
> "asynchronousEngine.getCAS()", any idea of problem
>
> 2016-11-03 16:49 GMT-04:00, Eddie Epstein <eaepst...@gmail.com>:
> > Hi,
> >
> > Binary serialization for a service call only works if the client and
> > service have identical type system definitions. Have you confirmed
> > everything works with the default XmiCas serialization?
> >
> > Eddie
> >
> > On Thu, Nov 3, 2016 at 3:51 PM, nelson rivera <nelsonriver...@gmail.com>
> > wrote:
> >
> >> I want to consume a service uima-as aggregate, the service have all
> >> delegates co-located, with format binary for serialization, I set
> >> SerializationStrategy as "binary" in the cliente side to the
> >> application context map used to pass initialization parameters. But
> >> when process i get this exception in te service uima-as:
> >>
> >>
> >> 01:42:00.679 - 14:
> >> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> >> handleProcessRequestFromRemoteClient:
> >> WARNING:
> >> java.lang.ClassCastException: org.apache.uima.cas.impl.AnnotationImpl
> >> cannot be cast to org.apache.uima.cas.SofaFS
> >> at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:834)
> >> at
> >> org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(
> >> FSIndexRepositoryImpl.java:2786)
> >> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_
> >> addFS(FSIndexRepositoryImpl.java:2763)
> >> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(
> >> FSIndexRepositoryImpl.java:2068)
> >> at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(
> >> CASImpl.java:1916)
> >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1640)
> >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1393)
> >> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1386)
> >> at org.apache.uima.cas.impl.Serialization.deserializeCAS(
> >> Serialization.java:187)
> >> at org.apache.uima.aae.UimaSerializer.deserializeCasFromBinary(
> >> UimaSerializer.java:223)
> >> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.
> >> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:229)
> >> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.
> >> handleProcessRequestFromRemoteClient(ProcessRequestHandler_
> impl.java:531)
> >> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> >> impl.handle(ProcessRequestHandler_impl.java:1062)
> >> at org.apache.uima.aae.handler.input.MetadataRequestHandler_
> >> impl.handle(MetadataRequestHandler_impl.java:78)
> >> at org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> >> onMessage(JmsInputChannel.java:731)
> >> at
> >> org.springframework.jms.listener.AbstractMessageListenerContain
> >> er.doInvokeListener(AbstractMessageListenerContainer.java:689)
> >> at
> >> org.springframework.jms.listener.AbstractMessageListenerContain
> >> er.invokeListener(AbstractMessageListenerContainer.java:649)
> >> at
> >> org.springframework.jms.listener.AbstractMessageListenerContain
> >> er.doExecuteListener(AbstractMessageListenerContainer.java:619)
> >> at
> >> org.springframework.jms.listener.AbstractPollingMessageListener
> >> Container.doReceiveAndExecute(AbstractPollingMessageListener
> >> Container.java:307)
> >> at
> >> org.springframework.jms.listener.AbstractPollingMessageListener
> >> Container.receiveAndExecute(AbstractPollingMessageListener
> >> Container.java:245)
> >> at
> >> org.springframework.jms.listener.DefaultMessageListenerContaine
> >> r$AsyncMessageListenerInvoker.invokeListener(
> >> DefaultMessageListenerContainer.java:1144)
> >> at
> >> org.springframework.jms.listener.DefaultMessageListenerContaine
> >> r$AsyncMessageListenerInvoker.executeOngoingLoop(
> >> DefaultMessageListenerContainer.java:1136)
> >> at
> >> org.springframework.jms.listener.DefaultMessageListenerContaine
> >> r$AsyncMessageListenerInvoker.run(DefaultMessageListenerContaine
> >> r.java:1033)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1145)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:615)
> >> at org.apache.uima.aae.UimaAsThreadFactory$1.run(
> >> UimaAsThreadFactory.java:132)
> >> at java.lang.Thread.run(Thread.java:745)
> >>
> >
>

Re: java.lang.ClassCastException with binary SerializationStrategy

2016-11-03 Thread Eddie Epstein

Hi,

Binary serialization for a service call only works if the client and
service have identical type system definitions. Have you confirmed
everything works with the default XmiCas serialization?

Eddie

On Thu, Nov 3, 2016 at 3:51 PM, nelson rivera 
wrote:

> I want to consume a service uima-as aggregate, the service have all
> delegates co-located, with format binary for serialization, I set
> SerializationStrategy as "binary" in the cliente side to the
> application context map used to pass initialization parameters. But
> when process i get this exception in te service uima-as:
>
>
> 01:42:00.679 - 14:
> org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient:
> WARNING:
> java.lang.ClassCastException: org.apache.uima.cas.impl.AnnotationImpl
> cannot be cast to org.apache.uima.cas.SofaFS
> at org.apache.uima.cas.impl.CASImpl.getSofa(CASImpl.java:834)
> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_addFS_common(
> FSIndexRepositoryImpl.java:2786)
> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.ll_
> addFS(FSIndexRepositoryImpl.java:2763)
> at org.apache.uima.cas.impl.FSIndexRepositoryImpl.addFS(
> FSIndexRepositoryImpl.java:2068)
> at org.apache.uima.cas.impl.CASImpl.reinitIndexedFSs(
> CASImpl.java:1916)
> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1640)
> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1393)
> at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1386)
> at org.apache.uima.cas.impl.Serialization.deserializeCAS(
> Serialization.java:187)
> at org.apache.uima.aae.UimaSerializer.deserializeCasFromBinary(
> UimaSerializer.java:223)
> at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> deserializeCASandRegisterWithCache(ProcessRequestHandler_impl.java:229)
> at org.apache.uima.aae.handler.input.ProcessRequestHandler_impl.
> handleProcessRequestFromRemoteClient(ProcessRequestHandler_impl.java:531)
> at org.apache.uima.aae.handler.input.ProcessRequestHandler_
> impl.handle(ProcessRequestHandler_impl.java:1062)
> at org.apache.uima.aae.handler.input.MetadataRequestHandler_
> impl.handle(MetadataRequestHandler_impl.java:78)
> at org.apache.uima.adapter.jms.activemq.JmsInputChannel.
> onMessage(JmsInputChannel.java:731)
> at org.springframework.jms.listener.AbstractMessageListenerContain
> er.doInvokeListener(AbstractMessageListenerContainer.java:689)
> at org.springframework.jms.listener.AbstractMessageListenerContain
> er.invokeListener(AbstractMessageListenerContainer.java:649)
> at org.springframework.jms.listener.AbstractMessageListenerContain
> er.doExecuteListener(AbstractMessageListenerContainer.java:619)
> at org.springframework.jms.listener.AbstractPollingMessageListener
> Container.doReceiveAndExecute(AbstractPollingMessageListener
> Container.java:307)
> at org.springframework.jms.listener.AbstractPollingMessageListener
> Container.receiveAndExecute(AbstractPollingMessageListener
> Container.java:245)
> at org.springframework.jms.listener.DefaultMessageListenerContaine
> r$AsyncMessageListenerInvoker.invokeListener(
> DefaultMessageListenerContainer.java:1144)
> at org.springframework.jms.listener.DefaultMessageListenerContaine
> r$AsyncMessageListenerInvoker.executeOngoingLoop(
> DefaultMessageListenerContainer.java:1136)
> at org.springframework.jms.listener.DefaultMessageListenerContaine
> r$AsyncMessageListenerInvoker.run(DefaultMessageListenerContaine
> r.java:1033)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
> at org.apache.uima.aae.UimaAsThreadFactory$1.run(
> UimaAsThreadFactory.java:132)
> at java.lang.Thread.run(Thread.java:745)
>

Re: UIMA DUCC limit max memory of node

2016-11-01 Thread Eddie Epstein

Hi,

You are right that ducc.agent.node.metrics.fake.memory.size will override
the agent's computation of total usable memory. This must be set as a java
property on the agent. To set this for all agents, add the following line
to site.ducc.properties in the resources folder and restart DUCC.
ducc.agent.jvm.args= -Xmx500M -Dducc.agent.node.metrics.fake.
memory.size=N
where N is in KB.

DUCC uses cgset -r cpu.shares=M to control a containers CPU. M is computed
as
   container-size-in-byes / total-memory-size-in-KB. So the maximum value
for M in a DUCC container is 1024.

cpu.shares controls CPU usage in a relative way. A container with
cpu.shares=1024 will potentially get 2x the CPU than a container with 512
shares. Note that if a container is using less than its share, other
containers will be allowed to get more than their share.

For newer OS, e.g. RHEL7, processes not put into a specific container are
put into the default container with cpu.shares = 1024. So if you break a
machine in half using fake.memory, and if DUCC were to fill its half up
with work, then the two halves of the box would have equal shares. Sounds
good for your scenario.

However, note that cpu.shares works for CPUs, not cores. so things may not
be so nice if hyperthreading is enabled. For example, consider a machine
with 32 cores and 2-way hyperthreading. A process burning 32 CPUs may
pretty much "max out" the machine even though there are 32 unused CPUs
available. To limit the DUCC half of a machine to only half the real
machine resources would require changing agent code to use the "*Ceiling
Enforcement Tunable Parameters*" which are absolute.

Eddie

On Tue, Nov 1, 2016 at 9:31 AM, Daniel Baumartz <
bauma...@stud.uni-frankfurt.de> wrote:

> Hi Eddie,
>
> ok, I will try to explain with more detail, maybe this is not how ducc is
> being used normally. We want to set up some nodes which are not exclusively
> used for ducc. For example, one of the nodes may have 100 GB, but we want
> the usable memory for ducc to only be 50 GB, not all free memory. (We also
> want to limit the CPU usage, for example only use 32 of 64 cores, but we
> have not tried to set this up yet.)
>
> We could not find any setting to achieve this behavior, so we tried using
> cgroups to limit the max usable memory for ducc. This did not work because
> ducc gets its memory info from /proc/meminfo which ignores the cgroups
> settings. After reading through the code it seems only setting
> "ducc.agent.node.metrics.fake.memory.size" (not setting up test mode) is
> doing something similar to what we want: "Comment from
> NodeMemInfoCollector.java: if running ducc in simulation mode skip memory
> adjustment. Report free memory = fakeMemorySize". But I am not sure if we
> can use this safely since it is for testing.
>
> So we basically want to give ducc an upper limit of usable memory.
>
> I hope it is a bit more clear what we want to achieve.
>
> Thanks again,
> Daniel
>
>
> Zitat von Eddie Epstein <eaepst...@gmail.com>:
>
>
> Hi Daniel,
>>
>> For each node Ducc sums RSS for all "system" user processes and excludes
>> that from Ducc usable memory on the node. System users are defined by a
>> ducc.properties setting with default value:
>> ducc.agent.node.metrics.sys.gid.max = 500
>>
>> Ducc's simulation mode is intended for creating a scaled out cluster of
>> fake nodes for testing purposes.
>>
>> The only mechanism available for reserving additional memory is to have
>> Ducc run some dummy process that stays up forever. This could be a Ducc
>> service that is automatically started when Ducc starts. This could get
>> complicated for a heterogeneous set of machines and/or Ducc classes.
>>
>> Can you be more precise of what features you are looking for limiting
>> resource use of Ducc machines?
>>
>> Thanks,
>> Eddie
>>
>>
>> On Mon, Oct 31, 2016 at 10:03 AM, Daniel Baumartz <
>> bauma...@stud.uni-frankfurt.de> wrote:
>>
>> Hi,
>>>
>>> I am trying to set up nodes for Ducc that should not use all the memory
>>> on
>>> the machine. I tried to limit the memory with cgroups, but it seems Ducc
>>> is
>>> getting the memory info from /proc/meminfo which ignores the cgroups
>>> settings.
>>>
>>> Did I miss an option to specify the max usable memory? Could I safely use
>>> "ducc.agent.node.metrics.fake.memory.size" from the simulation settings?
>>> Or is there a better way to do this?
>>>
>>> Thanks,
>>> Daniel
>>>
>>>
>>>
>
>
>

Re: UIMA DUCC limit max memory of node

2016-10-31 Thread Eddie Epstein

Hi Daniel,

For each node Ducc sums RSS for all "system" user processes and excludes
that from Ducc usable memory on the node. System users are defined by a
ducc.properties setting with default value:
ducc.agent.node.metrics.sys.gid.max = 500

Ducc's simulation mode is intended for creating a scaled out cluster of
fake nodes for testing purposes.

The only mechanism available for reserving additional memory is to have
Ducc run some dummy process that stays up forever. This could be a Ducc
service that is automatically started when Ducc starts. This could get
complicated for a heterogeneous set of machines and/or Ducc classes.

Can you be more precise of what features you are looking for limiting
resource use of Ducc machines?

Thanks,
Eddie

On Mon, Oct 31, 2016 at 10:03 AM, Daniel Baumartz <
bauma...@stud.uni-frankfurt.de> wrote:

> Hi,
>
> I am trying to set up nodes for Ducc that should not use all the memory on
> the machine. I tried to limit the memory with cgroups, but it seems Ducc is
> getting the memory info from /proc/meminfo which ignores the cgroups
> settings.
>
> Did I miss an option to specify the max usable memory? Could I safely use
> "ducc.agent.node.metrics.fake.memory.size" from the simulation settings?
> Or is there a better way to do this?
>
> Thanks,
> Daniel
>
>

Re: Uima Ducc Service restart on timeout

2016-10-29 Thread Eddie Epstein

Hi Wahed,

One approach would be to configure the service itself to self-destruct if
processing exceeds a processing threshold. UIMA-AS error configuration does
support timeouts for remote delegates, but not for in-process delegates. So
this would require starting a timer thread in the annotator that would call
exit() if not reset at the end of the process() method. DUCC will
automatically attempt to restart a service instance that exits.

DUCC's pinger API allows a service pinger to detect that a service instance
is not working correctly and tell DUCC to restart the instance. This
approach has been confirmed to work for the current trunk code, after post
v2.1.0 fixes.

Eddie

On Fri, Oct 28, 2016 at 10:07 AM, Wahed Hemati 
wrote:

> Hi,
>
> is there a mechanism in Ducc to restart a service, if it is processing a
> cas for to long?
>
> I have a annotator running as a primitive service on Ducc, which sometimes
> gets to an endless loop. I call this service with a Uima AS Client. I can
> set a timeout on the UIMA AS Client, this works. The client throws a
> timeout after the specified period, which is nice. However the Service is
> still processing the cas somehow? Can i tell Ducc to shutdown and restart a
> service, if it is processing a cas takes more then a specified time period
> (e.g 60 sec)?
>
> Thanks in advance
>
> -Wahed
>
>
>

Re: C++/Python annotators in Eclipse on Mac OS

2016-05-06 Thread Eddie Epstein

Hi Sean,

There are example .mak files for compiling and creating shared libraries
from C++ annotator code. A couple of env parameters need to be set for the
build. It should be straightforward to configure eclipse CDT to build an
annotator and a C++ application calling annotators from a makefile.

Python annotators sit on top of a C++ annotator. No idea about eclipse
support for that kind of thing.

Eddie

On Thu, May 5, 2016 at 3:44 PM, Sean Crist  wrote:

>
> Hello,
>
> I’m trying to set up the ability to write annotators in C++ and in Python
> using Eclipse on Mac OS X.
>
> I read the following two sources:
>
> https://uima.apache.org/doc-uimacpp-huh.html
>
> Also the README file in the download of UIMACPP
>
> Both documents seem geared for using UIMA from the command line in Windows
> or Linux.  It wasn’t immediately evident how to translate those
> instructions to my situation.  There were a few passing mentions of Eclipse
> or Mac OS, but nothing like a step-by-step.
>
> Is there a writeup on this that I’ve missed in my Google search?  Absent
> that, any pointers or suggestions on how to proceed?
>
> Thanks,
> —Sean Crist
>
>
>

Re: UIMACPP and multi-threading

2016-04-28 Thread Eddie Epstein

Benjamin,

Initial testing with the latest AMQ broker indicates an incompatibility
with the existing UIMACPP release. Along with the problems you have exposed
there is good motivation to get another uimacpp release out relatively
soon. THanks for exposing the GC/threading issue with the JNI and potential
fixes.

Eddie

On Tue, Apr 26, 2016 at 3:47 AM, Benjamin De Boe <
benjamin.de...@intersystems.com> wrote:

> Hi Eddie,
>
> I'm not familiar with the serializeJNI issue.
> Few sources still recommend implementing finalize(), because it is
> undetermined in which order the GC process will eventually invoke them. We
> also thought it was counterintuitive to see the UimacppEngine being
> finalized before the UimacppAnalysisComponent that wraps it, but that's
> what our extra logs quite consistently seemed to indicate, so that's
> probably just what the word "non-deterministic" means.
>
> This article suggests a few alternatives that may be considered for this
> UIMACPP / JNI issue in the long run:
> http://www.oracle.com/technetwork/java/javamail/finalization-137655.html
>
>
> Thanks,
> benjamin
>
> --
> Benjamin De Boe | Product Manager
> M: +32 495 19 19 27 | T: +32 2 464 97 33
> InterSystems Corporation | http://www.intersystems.com
>
> -Original Message-
> From: Eddie Epstein [mailto:eaepst...@gmail.com]
> Sent: Tuesday, April 26, 2016 4:58 AM
> To: user@uima.apache.org
> Cc: Jos Denys <jos.de...@intersystems.com>; Chen-Chieh Hsu <
> chen-chieh@intersystems.com>
> Subject: Re: UIMACPP and multi-threading
>
> Hi,
>
> Not the author of the JNI, but does it make sense that
> UimacppEngine.finalize() could be called while UimacppAnalysisComponent
> maintains a valid engine pointer to UimacppEngine? And once the engine
> pointer has been set to null, UimacppAnalysisComponent.destroy() will not
> call UimacppEngine.destroy(). Leaves me confused how this could happen.
>
> At any rate, do you think finalize is related to the serizalizeJNI problem?
>
> Eddie
>
>
>
>
>
> On Mon, Apr 25, 2016 at 8:27 AM, Benjamin De Boe <
> benjamin.de...@intersystems.com> wrote:
>
> > After some more debugging, it seems this is probably a garbage
> > collection issue rather than a multi-threading issue, although
> > multiple threads may well increase the likelihood of it happening.
> >
> > We've found that there are two methods on the CPP side for cleaning up
> > the memory used by the CPP engine: destroyJNI() and destructorJNI().
> > destructorJNI() is called from the UimacppEngine:finalize() method and
> > only deletes the pInstance pointer, whereas destroyJNI() does a lot
> > more work in cleaning up what lies beyond and is called through
> > UimacppEngine:destroy(), which in turn is invoked from
> UimacppAnalysisComponent:finalize().
> >
> > Now, the arcane magic in the GC process seems to first finish off the
> > UimacppEngine helper object (calling destructorJNI()) and then the
> > UimacppAnalysisComponent instance that contained the other one, with
> > its
> > destroyJNI() method then running into trouble because pInstance was
> > already deleted in destructorJNI(), causing the access violation we've
> > been struggling with.
> >
> > [logged as https://issues.apache.org/jira/browse/UIMA-4899 ]
> >
> > There are a number of ways how we could work around this (such as just
> > calling destroyJNI() in both cases, exiting early if it's already
> > cleaned up), but of course we'd hope someone of the original UIMACPP
> > team to weigh in and share the reasoning behind those two separate
> > methods and anything we're overlooking in our assessment. Anybody who
> > can recommend what we should do in the short run and how this might
> > translate into a fixed UIMA / UIMACPP release at some point? An
> > out-of-the-box 64-bit UIMACPP release would probably benefit more than
> > just us (cf https://issues.apache.org/jira/browse/UIMA-4900).
> >
> >
> >
> > Thanks,
> > benjamin
> >
> > --
> > Benjamin De Boe | Product Manager
> > M: +32 495 19 19 27 | T: +32 2 464 97 33 InterSystems Corporation |
> > http://www.intersystems.com
> >
> > -Original Message-
> > From: Eddie Epstein [mailto:eaepst...@gmail.com]
> > Sent: Thursday, April 7, 2016 1:58 PM
> > To: user@uima.apache.org
> > Subject: Re: UIMACPP and multi-threading
> >
> > Standalone.java certainly does show threading issues with uimacpp's JNI.
> > The multithread testing thru the JNI, like the one I did a few days
> > ago, was clearly not sufficient to declare it thread safe.
>

Re: UIMACPP and multi-threading

2016-04-25 Thread Eddie Epstein

Hi,

Not the author of the JNI, but does it make sense that
UimacppEngine.finalize() could be called while UimacppAnalysisComponent
maintains a valid engine pointer to UimacppEngine? And once the engine
pointer has been set to null, UimacppAnalysisComponent.destroy() will not
call UimacppEngine.destroy(). Leaves me confused how this could happen.

At any rate, do you think finalize is related to the serizalizeJNI problem?

Eddie





On Mon, Apr 25, 2016 at 8:27 AM, Benjamin De Boe <
benjamin.de...@intersystems.com> wrote:

> After some more debugging, it seems this is probably a garbage collection
> issue rather than a multi-threading issue, although multiple threads may
> well increase the likelihood of it happening.
>
> We've found that there are two methods on the CPP side for cleaning up the
> memory used by the CPP engine: destroyJNI() and destructorJNI().
> destructorJNI() is called from the UimacppEngine:finalize() method and only
> deletes the pInstance pointer, whereas destroyJNI() does a lot more work in
> cleaning up what lies beyond and is called through UimacppEngine:destroy(),
> which in turn is invoked from UimacppAnalysisComponent:finalize().
>
> Now, the arcane magic in the GC process seems to first finish off the
> UimacppEngine helper object (calling destructorJNI()) and then the
> UimacppAnalysisComponent instance that contained the other one, with its
> destroyJNI() method then running into trouble because pInstance was already
> deleted in destructorJNI(), causing the access violation we've been
> struggling with.
>
> [logged as https://issues.apache.org/jira/browse/UIMA-4899 ]
>
> There are a number of ways how we could work around this (such as just
> calling destroyJNI() in both cases, exiting early if it's already cleaned
> up), but of course we'd hope someone of the original UIMACPP team to weigh
> in and share the reasoning behind those two separate methods and anything
> we're overlooking in our assessment. Anybody who can recommend what we
> should do in the short run and how this might translate into a fixed UIMA /
> UIMACPP release at some point? An out-of-the-box 64-bit UIMACPP release
> would probably benefit more than just us (cf
> https://issues.apache.org/jira/browse/UIMA-4900).
>
>
>
> Thanks,
> benjamin
>
> --
> Benjamin De Boe | Product Manager
> M: +32 495 19 19 27 | T: +32 2 464 97 33
> InterSystems Corporation | http://www.intersystems.com
>
> -Original Message-
> From: Eddie Epstein [mailto:eaepst...@gmail.com]
> Sent: Thursday, April 7, 2016 1:58 PM
> To: user@uima.apache.org
> Subject: Re: UIMACPP and multi-threading
>
> Standalone.java certainly does show threading issues with uimacpp's JNI.
> The multithread testing thru the JNI, like the one I did a few days ago,
> was clearly not sufficient to declare it thread safe.
>
> Our local uimacpp development with regards thread safety was focused on
> multithread testing for the development of uimacpp's native AMQ service
> wrapper.
>
> If you do fix the JNI threading issues please consider contributing them
> back to ASF!
> Eddie
>
> On Tue, Apr 5, 2016 at 8:54 AM, Jos Denys <jos.de...@intersystems.com>
> wrote:
>
> > Hi Eddie,
> >
> > I worked on the CPP-side, and what I noticed was that the JNI
> > Interface always passes an instance pointer :
> >
> > JNIEXPORT void JNICALL JAVA_PREFIX(resetJNI) (JNIEnv* jeEnv, jobject
> > joJTaf) {
> >   try {
> > UIMA_TPRINT("entering resetDocument()");
> >
> > uima::JNIInstance* pInstance = JNIUtils::getCppInstance(jeEnv,
> > joJTaf);
> >
> >
> > Now the strange thing, and finally what caused the acces violation
> > error, was that the pInstance pointer was the same for the 3 threads
> > that
> > (simultaneously) did the UIMA processing, so it looks like the same
> > CAS was passed for 3 different analysis worker threads.
> >
> > Any idea why and how this can happen ?
> >
> > Thanks for your feedback,
> > Jos Denys,
> > InterSystems Benelux.
> >
> >
> > De : Benjamin De Boe
> > Envoyé : mardi 5 avril 2016 09:33
> > À : user@uima.apache.org
> > Cc : Jos Denys <jos.de...@intersystems.com>; Chen-Chieh Hsu <
> > chen-chieh@intersystems.com> Objet : RE: UIMACPP and
> > multi-threading
> >
> >
> > Hi Eddie,
> >
> >
> >
> > Thanks for your prompt response.
> >
> > In our experiment, we have one initial thread instantiating a CasPool
> > and then passing it on to newly spawned threads that each have their
> > own DaveDetector instance and fetch a new CAS from the shared pool.
> > T

Re: UIMACPP and multi-threading

2016-04-04 Thread Eddie Epstein

Hi Benjamin,

UIMACPP is thread safe, as is the JNI interface. To confirm, I just created
a UIMA-AS service with 10 instances of DaveDetector, and fed the service
800 CASes with up to 10 concurrent CASes at any time.

It is not the case with DaveDetector, but at annotator initialization some
analytics will store info in thread local storage, and expect the same
thread be used to call the annotator process method. UIMA-AS and DUCC
guarantee that an instantiated AE is always called on the same thread.

Eddie



On Mon, Apr 4, 2016 at 10:56 AM, Benjamin De Boe <
benjamin.de...@intersystems.com> wrote:

> Hi,
>
> We're working with a UIMACPP annotator (wrapping our existing NLP library)
> and are running in what appears to be thread safety issues, which we can
> reproduce with the DaveDetector demo AE.
> When separate threads are accessing separate instances of the
> org.apache.uima.uimacpp.UimacppAnalysisComponent wrapper class on the Java
> side, it appears they are invoking the same object on the C++ side, which
> results in quite a mess (access violations and process crashes) when
> different threads concurrently invoke resetJNI() and fillCASJNI() on the
> org.apache.uima.uimacpp.UimacppAnalysisComponent object. When using a small
> CAS pool on the Java side, the problem does not seem to occur, but it
> resurfaces if the CAS pool grows bigger and memory settings are not
> increased accordingly. However, if this were a pure memory issue, we had
> hoped to see more telling errors and just guessing how big memory should be
> for larger deployments isn't very appealing an option either.
> Adding the synchronized keyword to the relevant method of the wrapper
> class on the Java side also avoids the issue, at the obvious cost of
> performance. Moving to UIMA-AS is not an option for us, currently.
>
> Given that the documentation is not explicit about it, we're hoping to get
> an unambiguous answer from this list: is UIMACPP actually supposed to be
> thread-safe? We saw old and resolved JIRA's that addressed thread-safety
> issues for UIMACPP, so we assumed it was the case, but reality seems to
> point in the opposite direction.
>
>
> Thanks in advance for your feedback,
>
> benjamin
>
>
> --
> Benjamin De Boe | Product Manager
> M: +32 495 19 19 27 | T: +32 2 464 97 33
> InterSystems Corporation | http://www.intersystems.com
>
>

Re: DUCC: Unable to do "Fixed" type of Reservation

2016-03-31 Thread Eddie Epstein

Hi Reshu,

Reserve type allows users to allocate an unconstrained resource. Because
reserve allocations are not constrained by cgroup containers, in v2.x these
allocations were restricted to be an entire machine.

Fixed type allocations, which are always associated with a specific user
process, have CPU and memory constrained by cgroups, if cgroups are enabled
and properly configured. If DUCC does not recognize cgroup support for a
node it falls back to monitoring memory use and killing processes that
exceed the specified threshold above requested allocation size. CGroup
status for each node is shown on the System->Machines page.

Ubuntu locates the cgroup folder differently from Red Hat and Suse OS. DUCC
v2 does have a property to specify this location, but you have found
another bug, this time hopefully only in the documentation.

The default value for this property is:
ducc.agent.launcher.cgroups.basedir=/cgroup/ducc
To override, put a different entry in
{ducc_runtime}/resource/site.ducc.properties

Regards,
Eddie

On Thu, Mar 31, 2016 at 5:48 AM, reshu.agarwal 
wrote:

> Hi,
>
> In DUCC 1.x, we are able to do fixed reservation of some of the memory of
> Nodes but We are restricted to do "reserve" type of reservation in DUCC
> 2.x. I want to know the reason for the same.
>
> I am using ubuntu for DUCC installation and not be able to configure
> c-groups in it, So, I have tried to manage RAM utilization through FIXED
> reservation in DUCC 1.x. But, Now I have no option.
>
> Hope, you can solve my problem.
>
> Cheers.
>
> Reshu.
>

Re: DUCC 2.0.1 : JP Http Client Unable to Communicate with JD

2016-01-12 Thread Eddie Epstein

Hi Reshu,

This is caused by the CollectionReader running in the JobDriver putting
character data in the work item CAS that cannot be XML serialized. DUCC
needs to do better in making this problem clear.

Two choices to fix this: 1) have the CR screen for illegal characters and
not put them in the work item CAS, or 2) assuming that the illegal
characters do not cause problems for the analytics, use the standard DUCC
job model whereby the JobDriver sends references to the raw data and
CasMultipliers in the scaled out JobProcesses create the CASes to be
processed.

Regards,
Eddie

On Mon, Jan 11, 2016 at 11:36 PM, reshu.agarwal 
wrote:

>
> Hi,
>
> I was getting this error after 17 out of 200 documents were processed. I
> am unable to find any reason for the same. Please see the error below:
>
> INFO: Asynchronous Client Has Been Initialized. Serialization Strategy:
> [SerializationStrategy] Ready To Process.
> DuccAbstractProcessContainer.deploy()  User Container deployed
>  Deployed Processing Container - Initialization Successful - Thread 32
> DuccAbstractProcessContainer.deploy() > Deploying User Container
> ... UimaProcessContainer.doDeploy()
> 11 Jan 2016 17:18:36,969  INFO AgentSession - T[29] notifyAgentWithStatus
> ... Job Process State Changed - PID:24790. Process State: Initializing. JMX
> Url:N/A Dispatched State Update Event to Agent with IP:192.168.10.126
> DuccAbstractProcessContainer.deploy()  User Container deployed
>  Deployed Processing Container - Initialization Successful - Thread 34
> DuccAbstractProcessContainer.deploy() > Deploying User Container
> ... UimaProcessContainer.doDeploy()
> DuccAbstractProcessContainer.deploy()  User Container deployed
>  Deployed Processing Container - Initialization Successful - Thread 33
> 11 Jan 2016 17:18:38,277  INFO JobProcessComponent - T[33] setState
> Notifying Agent New State:Running
> 11 Jan 2016 17:18:38,279  INFO AgentSession - T[1] notifyAgentWithStatus
> ... Job Process State Changed - PID:24790. Process State: Running. JMX
> Url:service:jmx:rmi:///jndi/rmi://user:2106/jmxrmi Dispatched State Update
> Event to Agent with IP:192.168.10.126
> 11 Jan 2016 17:18:38,281  INFO AgentSession - T[33] notifyAgentWithStatus
> ... Job Process State Changed - PID:24790. Process State: Running. JMX
> Url:service:jmx:rmi:///jndi/rmi://user:2106/jmxrmi Dispatched State Update
> Event to Agent with IP:192.168.10.126
> 11 Jan 2016 17:18:38,281  INFO HttpWorkerThread - T[33]
> HttpWorkerThread.run()  Begin Processing Work Items - Thread Id:33
> 11 Jan 2016 17:18:38,285  INFO HttpWorkerThread - T[34]
> HttpWorkerThread.run()  Begin Processing Work Items - Thread Id:34
> 11 Jan 2016 17:18:38,285  INFO HttpWorkerThread - T[32]
> HttpWorkerThread.run()  Begin Processing Work Items - Thread Id:32
> 11 Jan 2016 17:18:38,458  INFO HttpWorkerThread - T[34] run  Thread:34
> Recv'd WI:19
> 11 Jan 2016 17:18:38,468  INFO HttpWorkerThread - T[32] run  Thread:32
> Recv'd WI:18
> 11 Jan 2016 17:18:38,478  INFO HttpWorkerThread - T[33] run  Thread:33
> Recv'd WI:21
> 11 Jan 2016 17:18:38,515 ERROR DuccHttpClient - T[33] execute  Unable to
> Communicate with JD - Error:HTTP/1.1 500  : The element type
> "org.apache.uima.ducc.container.net.impl.MetaCasTransaction" must be
> terminated by the matching end-tag
> "".
> 11 Jan 2016 17:18:38,515 ERROR DuccHttpClient - T[33] execute  Content
> causing error:[B@3c0873f9
> Thread::33 ERRR::Content causing error:[B@3c0873f9
> 11 Jan 2016 17:18:38,516 ERROR DuccHttpClient - T[33] run
> java.lang.RuntimeException: JP Http Client Unable to Communicate with JD -
> Error:HTTP/1.1 500  : The element type
> "org.apache.uima.ducc.container.net.impl.MetaCasTransaction" must be
> terminated by the matching end-tag
> "".
> at org.apache.uima.ducc.transport.configuration.jp
> .DuccHttpClient.execute(DuccHttpClient.java:226)
> at org.apache.uima.ducc.transport.configuration.jp
> .HttpWorkerThread.run(HttpWorkerThread.java:178)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at org.apache.uima.ducc.transport.configuration.jp
> .UimaServiceThreadFactory$1.run(UimaServiceThreadFactory.java:85)
> at java.lang.Thread.run(Thread.java:745)
> 11 Jan 2016 17:18:38,535 ERROR DuccHttpClient - T[33] execute  Unable to
> Communicate with JD - Error:HTTP/1.1 501 Method n>POST is not defined in
> RFC 2068 and is not supported by the Servlet API
> 11 Jan 2016 17:18:38,535 ERROR DuccHttpClient - T[33] execute  Content
> causing error:[B@12e81893
> Thread::33 ERRR::Content causing error:[B@12e81893
> 11 Jan 2016 17:18:38,535 ERROR DuccHttpClient - T[33]

Re: DUCC 1.1.0- Remain in Completing state.

2016-01-05 Thread Eddie Epstein

Hi Reshu,

Each DUCC machine has an agent responsible for starting and killing
processes.
There was a bug ( https://issues.apache.org/jira/browse/UIMA-4194 ) where
the
agent failed to issue a kill -9 against "hung" JPs when a job was stopping.
The fix is in v2.0.

Regards,
Eddie


On Tue, Jan 5, 2016 at 12:54 AM, reshu.agarwal 
wrote:

> I forget to mention one thing i.e. After Killing the job, next job is
> unable to initialize and remain in " WaitingForDriver" state. I have also
> checked sm.log,or.log,pm.log e.t.c. but failed to find any thing. I have to
> restart my DUCC for running job again.
>
> Reshu.
>
>
> On 01/05/2016 11:14 AM, reshu.agarwal wrote:
>
>> Hi,
>>
>> I am using DUCC 1.1.0 version. I am facing a issue with my job i.e. it
>> remains in completing state even after initializing the stop process. My
>> job used two processes. And Job Driver logs:
>>
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl
>> stop
>> INFO: Stopping Asynchronous Client.
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl
>> stop
>> INFO: Asynchronous Client Has Stopped.
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngineCommon_impl$SharedConnection
>> destroy
>> INFO: UIMA AS Client Shared Connection Has Been Closed
>> Jan 04, 2016 12:43:13 PM
>> org.apache.uima.adapter.jms.client.BaseUIMAAsynchronousEngine_impl stop
>> INFO: UIMA AS Client Undeployed All Containers
>>
>> One process logs:
>>
>> Jan 04, 2016 12:44:50 PM
>> org.apache.uima.adapter.jms.activemq.JmsInputChannel stopChannel
>> INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.87494
>> ShutdownNow false
>> Jan 04, 2016 12:44:50 PM
>> org.apache.uima.adapter.jms.activemq.JmsInputChannel stopChannel
>> INFO: Controller: ducc.jd.queue.87494 Stopped Listener on Endpoint:
>> queue://ducc.jd.queue.87494 Selector: Selector:Command=2000 OR Command=2002.
>>
>> But, other process do not have any log of stopping the process.
>>
>> The case is of not completely undeploying all processes. I have to use
>> command to cancel the process: /ducc_install/bin$ ./ducc_cancel --id 87494
>> --dpid 4529.
>>
>> Some times it cancelled the process otherwise I have to use "kill -9"
>> command to kill the job forcefully.
>>
>> Kindly help.
>>
>> Thanks in advanced.
>>
>> Reshu.
>>
>>
>>
>

Re: UIMA-DUCC installation with multiple machines

2015-11-30 Thread Eddie Epstein

Hi,

Did you confirm that user ducc@ducc-head can do passwordless ssh to
ducc-node-1?  If so, running ./check_ducc from the admin folder should give
some useful feedback about ducc-node-1.

Eddie

On Mon, Nov 30, 2015 at 5:14 AM, Sylvain Surcin 
wrote:

> Hello,
>
> Despite experimenting for a few weeks and reading the Ducc 2.0 doc book
> again and again, I am still unable to make it run on a test cluster of 2
> machines.
>
> I have 2 VMs (ducc-head and ducc-node-1) with a NFS share for /home/ducc
> and my $DUCC_HOME is /home/ducc/ducc_runtime.
>
> I generated and copied ssh keys so that my main user on ducc-head can do a
> passwordless ssh to ducc@ducc-head (that's what I understood from the doc,
> if not, can you be more specific, with Linux commands?).
>
> I compiled ducc_runtime on ducc-head and copied it on both machines in
> /local/ducc/bin with all the appropriate chown and chmod as stated in the
> doc, for both machines. I also edited site.ducc.properties accordingly.
>
> When I launch start_ducc (as ducc on ducc-head) I see both machines on the
> Web monitor but only ducc-head is up, while ducc-node-1 always stays
> "defined". Of course, when I submit the test job, it is executed on
> ducc-head, never on ducc-node-1.
>
> What am I missing?
> I have been stuck here for weeks. Please can you help me?
>
> Regards.
>

Re: remote Analysis Engines freely available

2015-10-13 Thread Eddie Epstein

There are several remote AE samples in the UIMA-AS sdk, currently "Apache
UIMA Version 2.6.0" download link at http://uima.apache.org/downloads.cgi.

$UIMA_HOME/examples/deploy/as includes
   Deploy_MeetingDetectorTAE.xml
   Deploy_MeetingFinder.xml
   Deploy_RoomNumberAnnotator.xml

After unpacking the tarball a quick start guide is in $UIMA_HOME/README

Eddie

On Tue, Oct 13, 2015 at 12:09 PM, Olivier Austina  wrote:

> Hi,Is there a remote analysis engine which is freely available. No matter
> the Analyser type. It is for demo only. Thank you.
> Regards
> Olivier
>

Re: C-Groups status remains off in web server after installing C-Groups

2015-10-05 Thread Eddie Epstein

Satya,

DUCC v2.0 made a small change in cgconfig.conf, adding two lines to enable
CPU control:
  cpu = /cgroup;
and
  cpu {}

Contents on the centos machine /cgroups are missing all of the "cpu.*"
entries.
The DUCC v2.0 agent does require cpu support to enable cgroup support.

Eddie


On Mon, Oct 5, 2015 at 12:10 AM, Satya Nand Kanodia <
satya.kano...@orkash.com> wrote:

> Dear Eddie,
>
>
> Below are the contents of /cgroup on centos machine.
>
> cgroup.event_control   memory.max_usage_in_bytes memory.oom_control
>   notify_on_release
> cgroup.procs   memory.memsw.failcnt memory.soft_limit_in_bytes
> release_agent
> ducc   memory.memsw.limit_in_bytes memory.stat
>  tasks
> memory.failcnt memory.memsw.max_usage_in_bytes memory.swappiness
> memory.force_empty memory.memsw.usage_in_bytes memory.usage_in_bytes
> memory.limit_in_bytes  memory.move_charge_at_immigrate memory.use_hierarchy
>
> these are the permissions on /cgroup/ducc
>
> drwxr-xr-x 2 ducc root 0 Oct  5 09:29 .
>
>
> Thanks and Regards,
> Satya Nand Kanodia
>
> On 10/01/2015 07:49 PM, Eddie Epstein wrote:
>
>> FYI, below are the contents of /groups on a SLES 11.2 machine:
>>
>> ~$ ls /cgroup
>> cgroup.clone_children  cpu.rt_runtime_us   memory.limit_in_bytes
>> memory.statsysdefault
>> cgroup.event_control   cpu.shares  memory.max_usage_in_bytes
>> memory.swappiness  tasks
>> cgroup.procs   cpu.statmemory.move_charge_at_immigrate
>> memory.usage_in_bytes
>> cpu.cfs_period_us  duccmemory.numa_stat
>> memory.use_hierarchy
>> cpu.cfs_quota_us   memory.failcnt  memory.oom_control
>> notify_on_release
>> cpu.rt_period_us   memory.force_empty  memory.soft_limit_in_bytes
>> release_agent
>>
>> ~$ ls -ld /cgroup/ducc/
>> drwxr-xr-x 2 ducc root 0 Sep  5 11:31 /cgroup/ducc/
>>
>>
>> On Thu, Oct 1, 2015 at 8:20 AM, Eddie Epstein <eaepst...@gmail.com>
>> wrote:
>>
>> Well, please list the contents of /cgroups to confirm that the custom
>>> cgconfig.conf is operating.
>>> Eddie
>>>
>>> On Thu, Oct 1, 2015 at 12:40 AM, Satya Nand Kanodia <
>>> satya.kano...@orkash.com> wrote:
>>>
>>> Hi Eddie,
>>>>
>>>> I had copied the same contents in  cgconfig.conf.(as it was also written
>>>> in documentation.)
>>>>
>>>> anything else ?
>>>>
>>>> Thanks and Regards,
>>>> Satya Nand Kanodia
>>>>
>>>> On 09/30/2015 05:28 PM, Eddie Epstein wrote:
>>>>
>>>> Hi Satya,
>>>>>
>>>>> There is a custom cgconfig.conf that has to be installed in /etc/
>>>>> before
>>>>> starting the cgconfig service. Please see step 2 in the section
>>>>> "CGroups
>>>>> Installation and Configuration". The custom config is repeated below.
>>>>> Regards, Eddie
>>>>>
>>>>>  # Mount cgroups
>>>>>  mount {
>>>>> memory = /cgroup;
>>>>> cpu = /cgroup;
>>>>>  }
>>>>>  # Define cgroup ducc and setup permissions
>>>>>  group ducc {
>>>>>   perm {
>>>>>   task {
>>>>>      uid = ducc;
>>>>>   }
>>>>>   admin {
>>>>>  uid = ducc;
>>>>>   }
>>>>>   }
>>>>>   memory {}
>>>>>   cpu{}
>>>>>  }
>>>>>
>>>>> On Wed, Sep 30, 2015 at 12:43 AM, Satya Nand Kanodia <
>>>>> satya.kano...@orkash.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> DUCC is running with the ducc user.
>>>>>> why did you say "DUCC's /etc/cgconfig.conf specifies user=ducc to
>>>>>> create
>>>>>> cgroups."? As I installed C-Groups with sudo yum, It has root as
>>>>>> owner.
>>>>>> Do
>>>>>> I need to change it's owner or permissions. It is having currently 644
>>>>>> permissions.
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Satya Nand Kanodia
>>>>>>
>>>>>> On 09/29/2015 06:46 PM, Eddie Epstein wrote:
>>>>>>
>>>>>> DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
>>>>>>
>>>>>>> Is DUCC running as user=ducc?
>>>>>>>
>>>>>>> Using sudo for cgreate testing suggests that the ducc userid is not
>>>>>>> being
>>>>>>> used.
>>>>>>>
>>>>>>> Eddie
>>>>>>>
>>>>>>> On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
>>>>>>> satya.kano...@orkash.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am using CentOS release 6.6 for DUCC installation. I did all
>>>>>>>> according
>>>>>>>> to documentation to enable C-Groups.
>>>>>>>> Following command executed without any error.( I had to execute it
>>>>>>>> using
>>>>>>>> sudo.)
>>>>>>>>
>>>>>>>> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>>>>>>>>
>>>>>>>> But on webserver in machines section , it is showing *off* status
>>>>>>>> under
>>>>>>>> the C-Groups.
>>>>>>>>
>>>>>>>> I don't know what went wrong.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks and Regards,
>>>>>>>> Satya Nand Kanodia
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Re: C-Groups status remains off in web server after installing C-Groups

2015-10-01 Thread Eddie Epstein

Well, please list the contents of /cgroups to confirm that the custom
cgconfig.conf is operating.
Eddie

On Thu, Oct 1, 2015 at 12:40 AM, Satya Nand Kanodia <
satya.kano...@orkash.com> wrote:

> Hi Eddie,
>
> I had copied the same contents in  cgconfig.conf.(as it was also written
> in documentation.)
>
> anything else ?
>
> Thanks and Regards,
> Satya Nand Kanodia
>
> On 09/30/2015 05:28 PM, Eddie Epstein wrote:
>
>> Hi Satya,
>>
>> There is a custom cgconfig.conf that has to be installed in /etc/ before
>> starting the cgconfig service. Please see step 2 in the section "CGroups
>> Installation and Configuration". The custom config is repeated below.
>> Regards, Eddie
>>
>> # Mount cgroups
>> mount {
>>memory = /cgroup;
>>cpu = /cgroup;
>> }
>> # Define cgroup ducc and setup permissions
>> group ducc {
>>  perm {
>>  task {
>> uid = ducc;
>>  }
>>  admin {
>> uid = ducc;
>>  }
>>  }
>>  memory {}
>>  cpu{}
>> }
>>
>> On Wed, Sep 30, 2015 at 12:43 AM, Satya Nand Kanodia <
>> satya.kano...@orkash.com> wrote:
>>
>> Hi,
>>>
>>> DUCC is running with the ducc user.
>>> why did you say "DUCC's /etc/cgconfig.conf specifies user=ducc to create
>>> cgroups."? As I installed C-Groups with sudo yum, It has root as owner.
>>> Do
>>> I need to change it's owner or permissions. It is having currently 644
>>> permissions.
>>>
>>> Thanks and Regards,
>>> Satya Nand Kanodia
>>>
>>> On 09/29/2015 06:46 PM, Eddie Epstein wrote:
>>>
>>> DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
>>>> Is DUCC running as user=ducc?
>>>>
>>>> Using sudo for cgreate testing suggests that the ducc userid is not
>>>> being
>>>> used.
>>>>
>>>> Eddie
>>>>
>>>> On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
>>>> satya.kano...@orkash.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>>> I am using CentOS release 6.6 for DUCC installation. I did all
>>>>> according
>>>>> to documentation to enable C-Groups.
>>>>> Following command executed without any error.( I had to execute it
>>>>> using
>>>>> sudo.)
>>>>>
>>>>> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>>>>>
>>>>> But on webserver in machines section , it is showing *off* status under
>>>>> the C-Groups.
>>>>>
>>>>> I don't know what went wrong.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks and Regards,
>>>>> Satya Nand Kanodia
>>>>>
>>>>>
>>>>>
>>>>>
>

Re: C-Groups status remains off in web server after installing C-Groups

2015-10-01 Thread Eddie Epstein

FYI, below are the contents of /groups on a SLES 11.2 machine:

~$ ls /cgroup
cgroup.clone_children  cpu.rt_runtime_us   memory.limit_in_bytes
memory.statsysdefault
cgroup.event_control   cpu.shares  memory.max_usage_in_bytes
memory.swappiness  tasks
cgroup.procs   cpu.statmemory.move_charge_at_immigrate
memory.usage_in_bytes
cpu.cfs_period_us  duccmemory.numa_stat
memory.use_hierarchy
cpu.cfs_quota_us   memory.failcnt  memory.oom_control
notify_on_release
cpu.rt_period_us   memory.force_empty  memory.soft_limit_in_bytes
release_agent

~$ ls -ld /cgroup/ducc/
drwxr-xr-x 2 ducc root 0 Sep  5 11:31 /cgroup/ducc/


On Thu, Oct 1, 2015 at 8:20 AM, Eddie Epstein <eaepst...@gmail.com> wrote:

> Well, please list the contents of /cgroups to confirm that the custom
> cgconfig.conf is operating.
> Eddie
>
> On Thu, Oct 1, 2015 at 12:40 AM, Satya Nand Kanodia <
> satya.kano...@orkash.com> wrote:
>
>> Hi Eddie,
>>
>> I had copied the same contents in  cgconfig.conf.(as it was also written
>> in documentation.)
>>
>> anything else ?
>>
>> Thanks and Regards,
>> Satya Nand Kanodia
>>
>> On 09/30/2015 05:28 PM, Eddie Epstein wrote:
>>
>>> Hi Satya,
>>>
>>> There is a custom cgconfig.conf that has to be installed in /etc/ before
>>> starting the cgconfig service. Please see step 2 in the section "CGroups
>>> Installation and Configuration". The custom config is repeated below.
>>> Regards, Eddie
>>>
>>> # Mount cgroups
>>> mount {
>>>memory = /cgroup;
>>>cpu = /cgroup;
>>> }
>>> # Define cgroup ducc and setup permissions
>>> group ducc {
>>>  perm {
>>>  task {
>>> uid = ducc;
>>>  }
>>>  admin {
>>> uid = ducc;
>>>  }
>>>  }
>>>  memory {}
>>>  cpu{}
>>> }
>>>
>>> On Wed, Sep 30, 2015 at 12:43 AM, Satya Nand Kanodia <
>>> satya.kano...@orkash.com> wrote:
>>>
>>> Hi,
>>>>
>>>> DUCC is running with the ducc user.
>>>> why did you say "DUCC's /etc/cgconfig.conf specifies user=ducc to create
>>>> cgroups."? As I installed C-Groups with sudo yum, It has root as owner.
>>>> Do
>>>> I need to change it's owner or permissions. It is having currently 644
>>>> permissions.
>>>>
>>>> Thanks and Regards,
>>>> Satya Nand Kanodia
>>>>
>>>> On 09/29/2015 06:46 PM, Eddie Epstein wrote:
>>>>
>>>> DUCC's /etc/cgconfig.conf specifies user=ducc to create cgroups.
>>>>> Is DUCC running as user=ducc?
>>>>>
>>>>> Using sudo for cgreate testing suggests that the ducc userid is not
>>>>> being
>>>>> used.
>>>>>
>>>>> Eddie
>>>>>
>>>>> On Tue, Sep 29, 2015 at 3:12 AM, Satya Nand Kanodia <
>>>>> satya.kano...@orkash.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>> I am using CentOS release 6.6 for DUCC installation. I did all
>>>>>> according
>>>>>> to documentation to enable C-Groups.
>>>>>> Following command executed without any error.( I had to execute it
>>>>>> using
>>>>>> sudo.)
>>>>>>
>>>>>> cgcreate -t ducc -a ducc -g memory:ducc/test-cgroups
>>>>>>
>>>>>> But on webserver in machines section , it is showing *off* status
>>>>>> under
>>>>>> the C-Groups.
>>>>>>
>>>>>> I don't know what went wrong.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Satya Nand Kanodia
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>
>

Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein

There was a Jira opened 7 years ago to support a cas.deleteView() method,
but it has been ignored due to lack of interest.
See https://issues.apache.org/jira/browse/UIMA-830

Eddie

On Mon, Sep 7, 2015 at 11:20 AM, Zesch, Torsten <torsten.ze...@uni-due.de>
wrote:

> Only if it could completely empty the CAS including the document text, but
> as far as I know the document text cannot be changed once it is set.
>
> Am 07/09/15 17:14 schrieb "Eddie Epstein" unter <eaepst...@gmail.com>:
>
> >Can the filter in the INNER_AAE modify such CASes, perhaps
> >by deleting data, that would result in the existing consumer
> >effectively ignoring them?
> >
> >On Mon, Sep 7, 2015 at 11:08 AM, Zesch, Torsten <torsten.ze...@uni-due.de
> >
> >wrote:
> >
> >> >The consumer does not have to be modified if the flow controller
> >> >drops CASes marked to be ignored.
> >> >
> >> >Sounds like the issue in this case is that the consumer is in the
> >> >OUTER_AAE, and there is a desire not to have any components
> >> >in the OUTER_AAE be aware of the filtering operation.
> >> >Is this right?
> >>
> >> Yes exactly.
> >>
> >> -Torsten
> >>
> >>
>
>

Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein

The consumer does not have to be modified if the flow controller
drops CASes marked to be ignored.

Sounds like the issue in this case is that the consumer is in the
OUTER_AAE, and there is a desire not to have any components
in the OUTER_AAE be aware of the filtering operation.
Is this right?

Eddie


On Sun, Sep 6, 2015 at 3:33 PM, Zesch, Torsten <torsten.ze...@uni-due.de>
wrote:

> Thanks for your input.
>
> To give some more information about our use case:
> Our input is a mix of documents.
> Only some of them are relevant and should be written by the consumer.
> We also thought about the solution with a special FeatureStructure, but
> this has the disadvantage that the consumer needs to be aware of that.
> It would be easier if some CASes could simply be dropped.
> I guess this could even be useful for flat workflows.
>
> -Torsten
>
>
> Am 06/09/15 17:31 schrieb "Eddie Epstein" unter <eaepst...@gmail.com>:
>
> >Keeping the filter inside the INNER may still be useful to
> >terminate any further processing in that AAE.
> >
> >outputsNewCases=true is just saying that an aggregate is
> >a CasMultiplier and *might* return child-CASes. It doesn't
> >change the CAS-in/CAS-out contract for the component.
> >
> >I think a fair amount of logic would have to be reworked
> >if that contract were changed. For sure in UIMA-AS,
> >where supporting CM services is one of the more complex
> >design issues. But maybe it would be interesting to see
> >the pros vs cons of making that change.
> >
> >Eddie
> >
> >
> >On Sun, Sep 6, 2015 at 11:20 AM, Richard Eckart de Castilho
> ><r...@apache.org>
> >wrote:
> >
> >> That would require that the OUTER_AAE is aware of the filtering.
> >> We would prefer if all customization/filtering/etc. could be done in the
> >> INNER_AAE which is the declared extension point.
> >>
> >> In the worst case, we'd probably opt to move the FILTER from to
> >> the OUTER_AAE entirely and make filtering a default option.
> >>
> >> My assumption would be that the OUTER_AAE should not have a problem
> >> if the INNER_AAE drops anything if INNER_AAE declares
> >>outputsNewCases=true.
> >> But obviously that assumption is wrong - I/we just don't get why.
> >>
> >> Cheers,
> >>
> >> -- Richard
> >>
> >> On 06.09.2015, at 17:14, Eddie Epstein <eaepst...@gmail.com> wrote:
> >>
> >> > How about the filter adds a FeatureStructure indicating that the CAS
> >> should
> >> > be dropped.
> >> > Then when the INNER_AAE returns the CAS, the flow controller in the
> >> > OUTER_AAE
> >> > sends the CAS to FinalStep?
> >> >
> >> > Eddie
> >> >
> >> > On Sun, Sep 6, 2015 at 11:08 AM, Richard Eckart de Castilho <
> >> r...@apache.org>
> >> > wrote:
> >> >
> >> >> Eddie,
> >> >>
> >> >> we (Torsten and I) have the case that a reader produces a number of
> >> CASes
> >> >> and we want to filter out some of them because they do not match a
> >>given
> >> >> criteria.
> >> >>
> >> >> The pipeline/flow structure we are using looks as follows:
> >> >>
> >> >> READER -> OUTER_AAE { AEs..., INNER_AAE { FILTER }, AEs..., CONSUMER
> >>}
> >> >>
> >> >> READER, OUTER_AAE, AEs and CONSUMER are assumed to be fixed.
> >> >>
> >> >> INNER_AAE is meant to be an extension point and the FILTER inside it
> >> >> is meant to remove all CASes that do not match our criteria such
> >> >> that those do not reach the CONSUMER.
> >> >>
> >> >> So we do explicitly not want certain CASes to continue the processing
> >> path.
> >> >>
> >> >> -- Richard
> >> >>
> >> >> On 06.09.2015, at 17:04, Eddie Epstein <eaepst...@gmail.com> wrote:
> >> >>
> >> >>> Richard,
> >> >>>
> >> >>> In general the input CAS must continue down some processing path.
> >> >>> Where is it stored and what triggers its continued processing if it
> >>is
> >> >> not
> >> >>> returned?
> >> >>>
> >> >>> Eddie
> >> >>>
> >> >>> On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho <
> &

Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein

Can the filter in the INNER_AAE modify such CASes, perhaps
by deleting data, that would result in the existing consumer
effectively ignoring them?

On Mon, Sep 7, 2015 at 11:08 AM, Zesch, Torsten 
wrote:

> >The consumer does not have to be modified if the flow controller
> >drops CASes marked to be ignored.
> >
> >Sounds like the issue in this case is that the consumer is in the
> >OUTER_AAE, and there is a desire not to have any components
> >in the OUTER_AAE be aware of the filtering operation.
> >Is this right?
>
> Yes exactly.
>
> -Torsten
>
>

Re: Error when trying to drop CAS with FlowController

2015-09-07 Thread Eddie Epstein

One way to allow a delegate to terminate subsequent flow of a primary CAS
would be for the built-in UIMA flow controller to assign FinalStep() to
CASes
containing some "drop-cas-mark".

Assuming your OUTER_AAE is not using a custom flow controller, this would
allow
the OUTER_AAE to respect a mark set by the delegate INNER_AAE without
changing
any code in the OUTER.

The next problem would be establishing a convention for the drop-cas-mark.


On Mon, Sep 7, 2015 at 12:34 PM, Richard Eckart de Castilho <r...@apache.org>
wrote:

> I don't think that cas.deleteView() would be a clean solution unless UIMA
> would be default drop any CAS that has its only remaining view removed.
>
> Dropping the whole unit-of-work (the CAS) instead of stripping its content
> appear to me a cleaner solution.
>
> -- Richard
>
> On 07.09.2015, at 17:45, Eddie Epstein <eaepst...@gmail.com> wrote:
>
> > There was a Jira opened 7 years ago to support a cas.deleteView() method,
> > but it has been ignored due to lack of interest.
> > See https://issues.apache.org/jira/browse/UIMA-830
> >
> > Eddie
> >
> > On Mon, Sep 7, 2015 at 11:20 AM, Zesch, Torsten <
> torsten.ze...@uni-due.de>
> > wrote:
> >
> >> Only if it could completely empty the CAS including the document text,
> but
> >> as far as I know the document text cannot be changed once it is set.
> >>
> >> Am 07/09/15 17:14 schrieb "Eddie Epstein" unter <eaepst...@gmail.com>:
> >>
> >>> Can the filter in the INNER_AAE modify such CASes, perhaps
> >>> by deleting data, that would result in the existing consumer
> >>> effectively ignoring them?
> >>>
> >>> On Mon, Sep 7, 2015 at 11:08 AM, Zesch, Torsten <
> torsten.ze...@uni-due.de
> >>>
> >>> wrote:
> >>>
> >>>>> The consumer does not have to be modified if the flow controller
> >>>>> drops CASes marked to be ignored.
> >>>>>
> >>>>> Sounds like the issue in this case is that the consumer is in the
> >>>>> OUTER_AAE, and there is a desire not to have any components
> >>>>> in the OUTER_AAE be aware of the filtering operation.
> >>>>> Is this right?
> >>>>
> >>>> Yes exactly.
> >>>>
> >>>> -Torsten
> >>>>
> >>>>
> >>
> >>
>
>

Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein

Hi Torsten,

The documentation says ...

public FinalStep(boolean aForceCasToBeDropped)

   Creates a new FinalStep, and may indicate that a CAS should be dropped.
   This can only be used for CASes that are produced internally to the
aggregate.
   It is an error to attempt to drop a CAS that was passed as input to the
aggregate.

The error must be because the drop is being applied to a CAS passed into the
aggregate from the outside, not created by a CasMultiplier inside the
aggregate.

Eddie

On Wed, Sep 2, 2015 at 4:22 PM, Zesch, Torsten 
wrote:

> Hi all,
>
> I'm trying to implement a FlowController that drops CASes matching certain
> critera. The FlowController is defined on an inner AAE which sets
> casproduced to true. The inner AAE resides in an outer AAE which contains
> additional processing before and after the inner AAE.
>
> Reader -> Outer AAE { ProcŠ Inner AAE { FlowController } ProcŠ Consumer}
> The aggregate receives various input CASes and is supposed to drop some
> but not others. When I try to drop a CAS in my FlowController now, I get
> the error
>
> Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException:
> The FlowController attempted to drop a CAS that was passed as input to the
> Aggregate AnalysisEngine containing that FlowController.  The only CASes
> that may be dropped are those that are created within the same Aggregate
> AnalysisEngine as the FlowController.
>
> How can I drop CASes using a FlowController such that they do not proceed
> in the outer aggregate?
>
>
> thanks,
> Torsten
>
>

Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein

Hi Richard,

FinalStep() in a CasMultiplier aggregate means to stop further flow
in the aggregate and return the CAS to the component that passed
the CAS into the aggregate, or if a child-CAS, passed the child's
parent-CAS into the aggregate.

FinalStep(true) is used to stop a child-CAS from being returned
to the component. But the contract for an AE is CAS-in/CAS-out,
which means a CAS coming into an AE must be returned.

Eddie

On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho <r...@apache.org>
wrote:

> Hi Eddie,
>
> ok, but why can input CASes created outside the aggregate not be dropped?
>
> Cheers,
>
> -- Richard
>
> On 06.09.2015, at 15:58, Eddie Epstein <eaepst...@gmail.com> wrote:
>
> > Hi Torsten,
> >
> > The documentation says ...
> >
> > public FinalStep(boolean aForceCasToBeDropped)
> >
> >   Creates a new FinalStep, and may indicate that a CAS should be dropped.
> >   This can only be used for CASes that are produced internally to the
> > aggregate.
> >   It is an error to attempt to drop a CAS that was passed as input to the
> > aggregate.
> >
> > The error must be because the drop is being applied to a CAS passed into
> the
> > aggregate from the outside, not created by a CasMultiplier inside the
> > aggregate.
> >
> > Eddie
> >
> >
> > On Wed, Sep 2, 2015 at 4:22 PM, Zesch, Torsten <torsten.ze...@uni-due.de
> >
> > wrote:
> >
> >> Hi all,
> >>
> >> I'm trying to implement a FlowController that drops CASes matching
> certain
> >> critera. The FlowController is defined on an inner AAE which sets
> >> casproduced to true. The inner AAE resides in an outer AAE which
> contains
> >> additional processing before and after the inner AAE.
> >>
> >> Reader -> Outer AAE { ProcŠ Inner AAE { FlowController } ProcŠ Consumer}
> >> The aggregate receives various input CASes and is supposed to drop some
> >> but not others. When I try to drop a CAS in my FlowController now, I get
> >> the error
> >>
> >> Caused by:
> org.apache.uima.analysis_engine.AnalysisEngineProcessException:
> >> The FlowController attempted to drop a CAS that was passed as input to
> the
> >> Aggregate AnalysisEngine containing that FlowController.  The only CASes
> >> that may be dropped are those that are created within the same Aggregate
> >> AnalysisEngine as the FlowController.
> >>
> >> How can I drop CASes using a FlowController such that they do not
> proceed
> >> in the outer aggregate?
> >>
> >>
> >> thanks,
> >> Torsten
>
>

Re: CAS merger/multiplier N:M mapping

2015-09-06 Thread Eddie Epstein

Hi Petr

On Sun, Sep 6, 2015 at 10:11 AM, Petr Baudis  wrote:

>   Hi!
>
>   I'm currently struggling to perform a complex flow transformation with
> UIMA.  I have multiple (N) CASes with some fulltext search results.
> I chop these search results to sentences and would like to pick the top
> M sentences from the search results collected and build CASes from them
> to do further analysis.  So, I'd like to copy subsets (document text
> wise and annotation wise) of N input CASes to M output CASes.  I don't
> know how to do this technically.  I tried two non-workable ideas so far:
>
>   (i) Keep around references to the respective views of input CASes
> and use them as CasCopier sources when the time comes to produce
> the new CASes.  Turns out the input CASes are (unsurprisingly) recycled
> and the references I kept around at process() time aren't valid when
> next() is called much later.
>
>   (ii) Use an internal "intermediary" CAS instance in process() to which
> I append my sentences, then use it as a source of output CASes.  Turns
> out (surprisingly) that I can't append to a sofa documenttext ("Data for
> Sofa feature setLocalSofaData() has already been set." - not sure about
> the reason for this restriction).
>

The Sofa data for a view is immutable, otherwise existing annotations
could become invalid.


>
>   I think the only choice except downright unmaintainable hacks (like
> programatically generated M views) is to just give up on preserving my
> annotations and carry over just the sentence texts.  Am I missing
> something?
>

Creating a new view in the intermediate CAS for each of the N input CASes
would work. A new output CAS Sofa would be comprised of data from
multiple views and of course the annotation end points adjusted as when
added to the new output CAS.

One problem there is that the intermediate CAS would continue to grow
in size, so there would need to be some point when it could be reset.


>
>   (I'm somewhat tempted to cut my losses short (much too late) and
> abandon UIMA flow control altogether, using only simple pipelines and
> having custom glue code to connect these together, as it seems like
> getting the flow to work in interesting cases is a huge time sink and in
> retrospect, it could never pay off any abstract advantage of easier
> distributed processing (where you probably end up having to chop up the
> pipeline manually anyway).  I would probably never recommend new UIMA
> users to strive for a single pipeline with CAS multipliers/mergers and
> begin to consider these features an evolutionary dead end rather than
> advantageous.  Not sure if there even *are* any other real users using
> advanced flows besides me and DeepQA.  I'll be glad to hear any opinions
> on this!)
>
>
Definitely the advantage to encapsulating analytics in standard UIMA
components is easy scalability via the vertical and horizontal scale out
options offered by UIMA-AS and DUCC. Flexibility in chopping up a
pipeline into services as needed is another advantage.

The previously mentioned GALE multimodal application also converted
sequences of N input CASes to M output CASes. In that case the input
CASes represented 2 minutes worth of speech-to-text transcription of
broadcast news, and each output CAS represented a single news story.
The story-CASes then went thru a pipeline that identified the story and
updated a pre-existing summarization for each story.

Eddie

--
> Petr Baudis
> If you have good ideas, good data and fast computers,
> you can do almost anything. -- Geoffrey Hinton
>

Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein

Richard,

In general the input CAS must continue down some processing path.
Where is it stored and what triggers its continued processing if it is not
returned?

Eddie

On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho <r...@apache.org>
wrote:

> Hi Eddie,
>
> in most cases, we use process(CAS) and in such a case what you describe
> is very logical.
>
> However, when setting outputsNewCases to true, doesn't the contract change?
> My understanding is that processAndOutputNewCASes(CAS) is being
> used and in such a case. Why shouldn't it be ok that the iterator
> returned by processAndOutputNewCASes does not contain the input CAS?
>
> Cheers,
>
> -- Richard
>
> On 06.09.2015, at 16:21, Eddie Epstein <eaepst...@gmail.com> wrote:
>
> > Hi Richard,
> >
> > FinalStep() in a CasMultiplier aggregate means to stop further flow
> > in the aggregate and return the CAS to the component that passed
> > the CAS into the aggregate, or if a child-CAS, passed the child's
> > parent-CAS into the aggregate.
> >
> > FinalStep(true) is used to stop a child-CAS from being returned
> > to the component. But the contract for an AE is CAS-in/CAS-out,
> > which means a CAS coming into an AE must be returned.
> >
> > Eddie
> >
> > On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho <
> r...@apache.org>
> > wrote:
> >
> >> Hi Eddie,
> >>
> >> ok, but why can input CASes created outside the aggregate not be
> dropped?
> >>
> >> Cheers,
> >>
> >> -- Richard
>
>

Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein

How about the filter adds a FeatureStructure indicating that the CAS should
be dropped.
Then when the INNER_AAE returns the CAS, the flow controller in the
OUTER_AAE
sends the CAS to FinalStep?

Eddie

On Sun, Sep 6, 2015 at 11:08 AM, Richard Eckart de Castilho <r...@apache.org>
wrote:

> Eddie,
>
> we (Torsten and I) have the case that a reader produces a number of CASes
> and we want to filter out some of them because they do not match a given
> criteria.
>
> The pipeline/flow structure we are using looks as follows:
>
> READER -> OUTER_AAE { AEs..., INNER_AAE { FILTER }, AEs..., CONSUMER }
>
> READER, OUTER_AAE, AEs and CONSUMER are assumed to be fixed.
>
> INNER_AAE is meant to be an extension point and the FILTER inside it
> is meant to remove all CASes that do not match our criteria such
> that those do not reach the CONSUMER.
>
> So we do explicitly not want certain CASes to continue the processing path.
>
> -- Richard
>
> On 06.09.2015, at 17:04, Eddie Epstein <eaepst...@gmail.com> wrote:
>
> > Richard,
> >
> > In general the input CAS must continue down some processing path.
> > Where is it stored and what triggers its continued processing if it is
> not
> > returned?
> >
> > Eddie
> >
> > On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho <
> r...@apache.org>
> > wrote:
> >
> >> Hi Eddie,
> >>
> >> in most cases, we use process(CAS) and in such a case what you describe
> >> is very logical.
> >>
> >> However, when setting outputsNewCases to true, doesn't the contract
> change?
> >> My understanding is that processAndOutputNewCASes(CAS) is being
> >> used and in such a case. Why shouldn't it be ok that the iterator
> >> returned by processAndOutputNewCASes does not contain the input CAS?
> >>
> >> Cheers,
> >>
> >> -- Richard
> >>
> >> On 06.09.2015, at 16:21, Eddie Epstein <eaepst...@gmail.com> wrote:
> >>
> >>> Hi Richard,
> >>>
> >>> FinalStep() in a CasMultiplier aggregate means to stop further flow
> >>> in the aggregate and return the CAS to the component that passed
> >>> the CAS into the aggregate, or if a child-CAS, passed the child's
> >>> parent-CAS into the aggregate.
> >>>
> >>> FinalStep(true) is used to stop a child-CAS from being returned
> >>> to the component. But the contract for an AE is CAS-in/CAS-out,
> >>> which means a CAS coming into an AE must be returned.
> >>>
> >>> Eddie
> >>>
> >>> On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho <
> >> r...@apache.org>
> >>> wrote:
> >>>
> >>>> Hi Eddie,
> >>>>
> >>>> ok, but why can input CASes created outside the aggregate not be
> >> dropped?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -- Richard
> >>
> >>
>
>

Re: Error when trying to drop CAS with FlowController

2015-09-06 Thread Eddie Epstein

Keeping the filter inside the INNER may still be useful to
terminate any further processing in that AAE.

outputsNewCases=true is just saying that an aggregate is
a CasMultiplier and *might* return child-CASes. It doesn't
change the CAS-in/CAS-out contract for the component.

I think a fair amount of logic would have to be reworked
if that contract were changed. For sure in UIMA-AS,
where supporting CM services is one of the more complex
design issues. But maybe it would be interesting to see
the pros vs cons of making that change.

Eddie


On Sun, Sep 6, 2015 at 11:20 AM, Richard Eckart de Castilho <r...@apache.org>
wrote:

> That would require that the OUTER_AAE is aware of the filtering.
> We would prefer if all customization/filtering/etc. could be done in the
> INNER_AAE which is the declared extension point.
>
> In the worst case, we'd probably opt to move the FILTER from to
> the OUTER_AAE entirely and make filtering a default option.
>
> My assumption would be that the OUTER_AAE should not have a problem
> if the INNER_AAE drops anything if INNER_AAE declares outputsNewCases=true.
> But obviously that assumption is wrong - I/we just don't get why.
>
> Cheers,
>
> -- Richard
>
> On 06.09.2015, at 17:14, Eddie Epstein <eaepst...@gmail.com> wrote:
>
> > How about the filter adds a FeatureStructure indicating that the CAS
> should
> > be dropped.
> > Then when the INNER_AAE returns the CAS, the flow controller in the
> > OUTER_AAE
> > sends the CAS to FinalStep?
> >
> > Eddie
> >
> > On Sun, Sep 6, 2015 at 11:08 AM, Richard Eckart de Castilho <
> r...@apache.org>
> > wrote:
> >
> >> Eddie,
> >>
> >> we (Torsten and I) have the case that a reader produces a number of
> CASes
> >> and we want to filter out some of them because they do not match a given
> >> criteria.
> >>
> >> The pipeline/flow structure we are using looks as follows:
> >>
> >> READER -> OUTER_AAE { AEs..., INNER_AAE { FILTER }, AEs..., CONSUMER }
> >>
> >> READER, OUTER_AAE, AEs and CONSUMER are assumed to be fixed.
> >>
> >> INNER_AAE is meant to be an extension point and the FILTER inside it
> >> is meant to remove all CASes that do not match our criteria such
> >> that those do not reach the CONSUMER.
> >>
> >> So we do explicitly not want certain CASes to continue the processing
> path.
> >>
> >> -- Richard
> >>
> >> On 06.09.2015, at 17:04, Eddie Epstein <eaepst...@gmail.com> wrote:
> >>
> >>> Richard,
> >>>
> >>> In general the input CAS must continue down some processing path.
> >>> Where is it stored and what triggers its continued processing if it is
> >> not
> >>> returned?
> >>>
> >>> Eddie
> >>>
> >>> On Sun, Sep 6, 2015 at 10:28 AM, Richard Eckart de Castilho <
> >> r...@apache.org>
> >>> wrote:
> >>>
> >>>> Hi Eddie,
> >>>>
> >>>> in most cases, we use process(CAS) and in such a case what you
> describe
> >>>> is very logical.
> >>>>
> >>>> However, when setting outputsNewCases to true, doesn't the contract
> >> change?
> >>>> My understanding is that processAndOutputNewCASes(CAS) is being
> >>>> used and in such a case. Why shouldn't it be ok that the iterator
> >>>> returned by processAndOutputNewCASes does not contain the input CAS?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -- Richard
> >>>>
> >>>> On 06.09.2015, at 16:21, Eddie Epstein <eaepst...@gmail.com> wrote:
> >>>>
> >>>>> Hi Richard,
> >>>>>
> >>>>> FinalStep() in a CasMultiplier aggregate means to stop further flow
> >>>>> in the aggregate and return the CAS to the component that passed
> >>>>> the CAS into the aggregate, or if a child-CAS, passed the child's
> >>>>> parent-CAS into the aggregate.
> >>>>>
> >>>>> FinalStep(true) is used to stop a child-CAS from being returned
> >>>>> to the component. But the contract for an AE is CAS-in/CAS-out,
> >>>>> which means a CAS coming into an AE must be returned.
> >>>>>
> >>>>> Eddie
> >>>>>
> >>>>> On Sun, Sep 6, 2015 at 9:59 AM, Richard Eckart de Castilho <
> >>>> r...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Eddie,
> >>>>>>
> >>>>>> ok, but why can input CASes created outside the aggregate not be
> >>>> dropped?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> -- Richard
> >>>>
> >>>>
> >>
> >>
>
>

Re: DUCC multi-node installation. Beginner's questions.

2015-07-22 Thread Eddie Epstein

Hi Sergii,

The ducc_runtime tree needs to be installed on a shared filesystem
that all DUCC nodes have mounted in the same location. Just install
the ducc runtime once from the DUCC head node. All other DUCC
nodes simply need to have the mounted filesystem and common user
accounts with identical user and group IDs.

An NFS shared filesystem as you describe would be fine, and the
same filesystem could be used for providing shared user space,
typically the users home directories.

Eddie

Re: Multi-threaded UIMA ParallelStep

2015-05-20 Thread Eddie Epstein

Parallel-step currently only works with remote delegates. The other
approach, using CasMultipliers, allows an arbitrarily amount of parallel
processing in-process. A CM would create a separate CAS for each delegate
intended to run in parallel, and use a feature structure to hold a unique
identifier in each child CAS which a custom flow controller would use to
direct these CASes to the desired delegates. Results for the parallel flows
could be merged in a CasConsumer back into the parent CAS or to some other
output.

Some other key concepts here are the CasCopier, which can be used to
efficiently copy large amounts of CAS content from one CAS to another, and
process-parent-last which can be specified for a CasMultiplier so that
further processing of a parent CAS will not continue until all of its
children have completed processing.

Eddie

On Tue, May 19, 2015 at 9:27 PM, Petr Baudis pa...@ucw.cz wrote:

Hi!

I'm looking into ways to run a part of my pipeline multi-threaded:

.- Multip0 - A1 - Multip1 - A2 -.
reader - A0CASmerger
`- Multip2 - A3 A2 -'
^^
ParallelStep is generated for each branch
in a custom flow controller

Basically, I need a way to tell UIMA to run each ParallelStep (which
normally just denotes the CAS flow) truly in parallel. I have two
constraints:

(i) I'm using UIMAfit heavily, and multiple CAS multipliers and
mergers (even within the parallel branches). So I can't use CPE.

(ii) I need multi-threading, not separate processes. (I have just
a meager 24G RAM (sigh) and one Java process with all the linguistic
models and stuff loaded takes 3GB RAM. So I really need to load these
resources to memory only once.)

I looked into UIMA-AS, including Richard's helpful DKpro-lab code
sample, but I can't figure out how to make it reasonably work with
a *complex* UIMAfit pipeline that spans many branches and many
analysis engines - it seems to me that I would need some centralized
places where to specify it, and basically completely rewrite my pipeline
building code (to the worse, in my impression).

...and I'm not even sure, from reading UIMA-AS code, if I could make
it run in multiple threads within a single process! From comments in

org/apache/uima/aae/controller/AggregateAnalysisEngineController_impl.java:parallelStep()

I'm getting an impression that non-remote AEs will be executed serially
after all, not in parallel. Is that correct?

So going back to the original UIMA code, it seems to me that the thing
to do would be replacing ASB_impl with my own copy (inheritance would
not cut it the way it's coded), AggregateAnalysisEngine_impl with my own
specialization or copy (as ASB_impl usage is hardcoded there) and
rewrite the while() loop in ParallelStep case of ASB's
processUntilNextOutputCas() to run in parallel. And hope I didn't miss
any catch...

Is there an option I'm missing? Any hints would be really
appreciated!

Thanks,

Petr Baudis

Re: Multi-threaded UIMA ParallelStep

2015-05-20 Thread Eddie Epstein

Right about the flow controller. That's where UIMA-AS comes in. Assuming
that the CM has a casPool with enough CASes, and the aggregate is deployed
asynchronously, then each delegate will be running in its own thread and
can be processing CASes in parallel.

The ASB is a single-theaded controller used for deployment of synchronous
aggregates.

Is the intention here to use parallel processing to reduce latency for a
interactive application or to increase throughput for batch processing? For
throughput, why not just deploy the entire pipeline single-threaded and
then run multiple pipeline instances in separate threads? UIMA-AS would do
this by specifying N instances of a synchronous top-level aggregate.

Eddie


On Wed, May 20, 2015 at 8:49 AM, Petr Baudis pa...@ucw.cz wrote:

   Hi!

 On Wed, May 20, 2015 at 07:56:33AM -0400, Eddie Epstein wrote:
  Parallel-step currently only works with remote delegates. The other
  approach, using CasMultipliers, allows an arbitrarily amount of parallel
  processing in-process. A CM would create a separate CAS for each delegate
  intended to run in parallel, and use a feature structure to hold a unique
  identifier in each child CAS which a custom flow controller would use to
  direct these CASes to the desired delegates. Results for the parallel
 flows
  could be merged in a CasConsumer back into the parent CAS or to some
 other
  output.

   Thanks for that hint.  However, I'm not sure how a flow controller
 could direct CASes to delegates?  As far as I understand it, the flow
 controller decides which AE processes the CAS next, but cannot control
 the actual parallel execution of the flow, which would need to be taken
 care by the ASB (Analysis Structure Broker), and that would be the thing
 to hack in this case.  Am I missing something?

   Thanks,

 Petr Baudis

Re: DUCC- process_dd

2015-05-01 Thread Eddie Epstein

Reshu,

UIMA-AS configurations are normally used in DUCC as Services for
interactive applications or to support Jobs. They can be used in Jobs, but
typically are not.

There is also a difference in the inputs between Job processes and
Services. Services will normally receive a CAS with the artifact to be
analyzed. A Job process will receive a CAS with reference to the artifact
or even collection of artifacts; this is important for Job scale out to
avoid making the Job's Collection Reader a bottleneck.

I suggest starting with one of the sample applications and adapting it to
your needs. We can help if you give some details about the format of the
input and output data.

Eddie

On Fri, May 1, 2015 at 12:31 AM, reshu.agarwal reshu.agar...@orkash.com
wrote:

 Eddie,

 I was using this same scenario and doing hit and try to compare this with
 UIMA AS to get the more scaled pipeline as I think UIMA AS can also did
 this. But I am unable to touch the processing time of DUCC's default
 configuration like you mentioned with UIMA AS.

 Can you help me in doing this? I just want to do scaling by using best
 configuration of UIMA AS and DUCC which can be done using process_dd. But
 How??

 Thanks in advanced.

 Reshu.


 On 05/01/2015 03:28 AM, Eddie Epstein wrote:

 The simplest way of vertically scaling a Job process is to specify the
 analysis pipeline using core UIMA descriptors and then using
 --process_thread_count to specify how many copies of the pipeline to
 deploy, each in a different thread. No use of UIMA-AS at all. Please check
 out the Raw Text Processing sample application that comes with DUCC.

 On Wed, Apr 29, 2015 at 12:30 AM, reshu.agarwal reshu.agar...@orkash.com
 
 wrote:

  Ohh!!! I misunderstand this. I thought this would scale my Aggregate and
 AEs both.

 I want to scale aggregate as well as individual AEs. Is there any way of
 doing this in UIMA AS/DUCC?



 On 04/28/2015 07:14 PM, Jaroslaw Cwiklik wrote:

  In async aggregate you scale individual AEs not the aggregate as a
 whole.
 The below configuration should do that. Are there any warnings from
 dd2spring at startup with your configuration?

 analysisEngine async=true 

   delegates
   analysisEngine
 key=ChunkerDescriptor
   scaleout
 numberOfInstances=5 /
   /analysisEngine
   analysisEngine
 key=NEDescriptor
   scaleout
 numberOfInstances=5 /
   /analysisEngine
   analysisEngine
 key=StemmerDescriptor
   scaleout
 numberOfInstances=5 /
   /analysisEngine
   analysisEngine
 key=ConsumerDescriptor
   scaleout
 numberOfInstances=5 /
   /analysisEngine
   /delegates
   /analysisEngine

 Jerry

 On Tue, Apr 28, 2015 at 5:20 AM, reshu.agarwal 
 reshu.agar...@orkash.com
 wrote:

   Hi,

 I was trying to scale my processing pipeline to be run in DUCC
 environment
 with uima as process_dd. If I was trying to scale using the below given
 configuration, the threads started were not as expected:


 analysisEngineDeploymentDescription
   xmlns=http://uima.apache.org/resourceSpecifier;

   nameUima v3 Deployment Descripter/name
   descriptionDeploys Uima v3 Aggregate AE using the Advanced
 Fixed
 Flow
   Controller/description

   deployment protocol=jms provider=activemq
   casPool numberOfCASes=5 /
   service
   inputQueue endpoint=UIMA_Queue_test
 brokerURL=tcp://localhost:61617?jms.useCompression=true prefetch=0
 /
   topDescriptor
   import


 location=../Uima_v3_test/desc/orkash/ae/aggregate/FlowController_Uima.xml
 /
   /topDescriptor
   analysisEngine async=true
 key=FlowControllerAgg internalReplyQueueScaleout=10
 inputQueueScaleout=10
   scaleout numberOfInstances=5/
   delegates
   analysisEngine
 key=ChunkerDescriptor
   scaleout
 numberOfInstances=5 /
   /analysisEngine
   analysisEngine
 key=NEDescriptor
   scaleout
 numberOfInstances=5 /
   /analysisEngine
   analysisEngine
 key

Re: UIMA-AS and ActiveMQ ports

2015-04-27 Thread Eddie Epstein

UIMA-AS has example deployment descriptors using placeholders for the
broker: ${defaultBrokerURL}
If these placeholders are used and the user doesn't specify a value for the
Java property defaultBrokerURL then some code in UIMA-AS will use a
default value of tcp://localhost:61616. That is the only dependency I am
aware of.
Eddie

On Mon, Apr 27, 2015 at 4:47 PM, D. Heinze dhei...@gnoetics.com wrote:

 Does UIMA-AS have internal dependencies on ActiveMQ port 61616?  I can
 change my applications to use other ports, but it seems that 61616 still
 needs to be available for something in UIMA-AS.

 Thanks / Dan

Re: Error handling in flow control

2015-04-24 Thread Eddie Epstein

Can you give more details on the overall pipeline deployment? The initial
description mentions a CPE and it mentions services. The CPE was created
before flow controllers or CasMutipliers existed and has no support of
them. Services could be Vinci services for the CPE or UIMA-AS services or
???

On Fri, Apr 24, 2015 at 5:37 AM, Mario Gazzo mario.ga...@gmail.com wrote:

 I am trying to get error handling to work with a custom flow control. I
 need to send status information back to a service after the flow completed
 either with or without errors but I can only do this once for any workflow
 item because it changes the state of the job, at least without error
 replies and wasteful requests. The problem is that I need to do several
 retries before finally failing and reporting the status to a service. First
 I tried to let the CPE do the retry for me by setting the max error count
 but then a new flow object is created every time and I loose track of the
 number of retries before this. This means that I don’t know when to report
 the status to the service because it should only happen after the final
 retry.

 I then tried to let the flow instance manage the retries by moving back to
 the previous step again but then I get the error
 “org.apache.uima.cas.CASRuntimeException: Data for Sofa feature
 setLocalSofaData() has already been set”, which is because the document
 text is set in this particular test case. I then also tried to reset the
 CAS completely before retrying the pipeline from scratch and this of course
 throws the error “CASAdminException: Can't flush CAS, flushing is
 disabled.”. It would be less wasteful if only the failed step is retried
 instead of the whole pipeline but this requires clean up, which in some
 cases might be impossible. It appears that managing errors can be rather
 complex because the CAS can be in an unknown state and an analysis engine
 operation is not idempotent. I probably need to start the whole pipeline
 from the start if I want more than a single attempt, which gets me back to
 the problem of tracking the number of attempts before reporting back to the
 service.

 Does anyone have any good suggestion on how to do this in UIMA e.g.
 passing state information from a failed flow to the next flow attempt?

Re: UIMA CPE appears not to utilise more than a single thread

2015-04-13 Thread Eddie Epstein

The CPE runs pipeline threads in parallel, not necessarily CAS processors.
In a CPE descriptor, generally all non-CasConsumer components make up the
pipeline.

Change the following line to indicate how many pipeline threads to run, and
make sure the casPoolSize is number of threads +2.

casProcessors casPoolSize=2 processingUnitThreadCount=1

Eddie

On Mon, Apr 13, 2015 at 7:44 AM, Mario Gazzo mario.ga...@gmail.com wrote:

 It appears that I can only utilise a single CAS processor even if I
 specify many more. I am not sure what I am doing wrong but I think I must
 be missing something important in my configuration.

 We only need multithreading and not the distributed features of UIMA CPE
 or similar. I copied and modified the UIMA FIT CpePipeline and CpeBuilder
 to do this and I only altered thread counts and error handling since I want
 the CAS just to be dropped on exceptions. I have verified that the accurate
 number of CAS processors are created using the debugger and I can in
 JConsole see that an equivalent amount of active threads are created but
 only one thread seems to be fed from my simple custom collection reader,
 which in the simple test setup only reads input entries from a file. I can
 see this because I log the thread id inside the AEs, which is always the
 same. I have also verified that the CAS pool size equals the number of
 processors + 2.

 Is there some additional collection reader configuration required to feed
 all the other CAS processors?

Re: Ducc Problems

2015-03-03 Thread Eddie Epstein

.
 activemq.JmsInputChannel
 stopChannel
 INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
 queue://ducc.jd.queue.13202 Selector:  Selector:Command=2001.
 UIMA-AS Service is Stopping, All CASes Have Been Processed
 Feb 19, 2015 5:39:56 PM org.apache.uima.aae.controller.
 PrimitiveAnalysisEngineController_impl stop
 INFO: Stopping Controller: ducc.jd.queue.13202
 Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
 activemq.JmsInputChannel
 stopChannel
 INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.13202
 ShutdownNow true
 Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
 activemq.JmsInputChannel
 stopChannel
 INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
 queue://ducc.jd.queue.13202 Selector:  Selector:Command=2000 OR
 Command=2002.
 Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
 activemq.JmsInputChannel
 stopChannel
 INFO: Stopping Service JMS Transport. Service: ducc.jd.queue.13202
 ShutdownNow true
 Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
 activemq.JmsInputChannel
 stopChannel
 INFO: Controller: ducc.jd.queue.13202 Stopped Listener on Endpoint:
 queue://ducc.jd.queue.13202 Selector:  Selector:Command=2001.
 Feb 19, 2015 5:39:56 PM org.apache.uima.adapter.jms.
 activemq.JmsOutputChannel
 stop
 INFO: Controller: ducc.jd.queue.13202 Output Channel Shutdown Completed

 Thanks Reshu.



 On 02/20/2015 12:40 AM, Jaroslaw Cwiklik wrote:

  One possible explanation for destroy() not getting called is that a
 process
 (JP) may be still working on a CAS when Ducc deallocates the process.
 Ducc
 first asks the process to quiesce and stop and allows it 1 minute to
 terminate on its own. If this does not happen, Ducc kills the process
 via
 kill -9. In such case the process will be clobbered and destroy()
 methods
 in UIMA-AS are not called.
 There should be some evidence in JP logs at the very end. Look for
 something like this:

   Process Received a Message. Is Process target for message:true.

 Target PID:27520

 configFactory.stop() - stopped

 route:mina:tcp://localhost:49338?transferExchange=truesync=false

 01:56:22.735 - 94:
 org.apache.uima.aae.controller.PrimitiveAnalysisEngineControl
 ler_impl.quiesceAndStop:
 INFO: Stopping Controller: ducc.jd.queue.226091
 Quiescing UIMA-AS Service. Remaining Number of CASes to Process:0

 Look at the timestamp of  Process Received a Message. Is
 Process
 target for message:true.
 and compare it to a timestamp of the last log message. Does it look like
 there is a long delay?


 Jerry

 On Wed, Feb 18, 2015 at 2:03 AM, reshu.agarwal 
 reshu.agar...@orkash.com
 wrote:

   Dear Eddie,

 This problem has been resolved by using destroy method in ducc version
 1.0.0 but when I upgrade my ducc version from 1.0.0 to 1.1.0 DUCC
 didn't
 call the destroy method.

 It also do not call the stop method of CollectionReader as well as
 finalize method of any java class as well as destroy/
 collectionProcessComplete
 method of cas consumer.

 I want to close my connection to Database after completion of job as
 well
 as want to use batch processing at cas consumer level like
 PersonTitleDBWriterCasConsumer.

 Thanks in advanced.

 Reshu.




 On 03/31/2014 04:14 PM, reshu.agarwal wrote:

   On 03/28/2014 05:28 PM, Eddie Epstein wrote:

   Another alternative would be to do the final flush in the Cas

 consumer's
 destroy method.

 Another issue to be aware of, in order to balance resources between
 jobs,
 DUCC uses preemption of job processes scheduled in a fair-share
 class.
 This may not be acceptable for jobs which are doing incremental
 commits.
 The solution is to schedule the job in a non-preemptable class.


 On Fri, Mar 28, 2014 at 1:22 AM, reshu.agarwal 
 reshu.agar...@orkash.com

  wrote:

 On 03/28/2014 01:28 AM, Eddie Epstein wrote:

 Hi Reshu,

  The Job model in DUCC is for the Collection Reader to send work
 item
 CASes, where a work item represents a collection of work to be
 done
 by a
 Job Process. For example, a work item could be a file or a subset
 of
 a
 file
 that contains many documents, where each document would be
 individually
 put
 into a CAS by the Cas Multiplier in the Job Process.

 DUCC is designed so that after processing the mini-collection
 represented
 by the work item,  the Cas Consumer should flush any data. This is
 done by
 routing the work item CAS to the Cas Consumer, after all work
 item
 documents are completed, at which point the CC does the flush.

 The sample code described in
 http://uima.apache.org/d/uima-ducc-1.0.0/duccbook.html#x1-1380009
 uses
 the
 work item CAS to flush data in exactly this way.

 Note that the PersonTitleDBWriterCasConsumer is doing a flush (a
 commit)
 in
 the process method after every 50 documents.

 Regards
 Eddie



 On Thu, Mar 27, 2014 at 1:35 AM, reshu.agarwal 
 reshu.agar...@orkash.com
 wrote:

 On 03/26/2014 11:34 PM, Eddie Epstein wrote:

  Hi Reshu,

   The collectionProcessingComplete() method in UIMA-AS has

Re: DUCC- Agent1 is on Physical and Agent2 is on virtual=Slow the job process timing

2014-12-19 Thread Eddie Epstein

Hi Reshu,

On Fri, Dec 19, 2014 at 12:26 AM, reshu.agarwal reshu.agar...@orkash.com
wrote:


 Hi,

 Is there any problem if one Agent node is on Physical(Master) and one
 agent node is on virtual?

 I am running a job which is having avg processing timing of 20 min when I
 have configured a single machine DUCC (physical machine)as well as when
 both nodes were on physical machine only.


So the job is running at the same speed on one physical machine as on two
physical machines?The two machine have similar CPU performance and number
of cores?



 When I have shifted my one agent node to virtual machine avg processing
 timing of Job was increased to 1 Hour. Here I noticed that my job driver
 was also running only on virtual machine's agent node.


Hard to diagnose with such little information. Look on the job details
page. The work items tab will show the processing time for each work item
and the machine it ran on. See if the timing is clearly different for work
on virtual machine. Look at the performance tab and compare the breakdown
between fast and slow jobs, where are the differences?

There are several factors which influence performance. Is the job process
CPU bound or does it have significant I/O wait time? Are the total number
of processing threads on a machine much more than the number of real CPU
cores on the machine? Average CPU usage for each JP are shown on the
processes tab of each job; 100% = one CPU, 200% = 2 CPU. How do these
numbers look vs the number of processing threads each JP is running?



 Can we run job driver to specific agent node so that I will be able to
 test any other Case Scenario? Because I also tried to run my job's process
 on agent node of physical machine but it didn't reflect the processing time
 much.


Normally a job driver is not a bottleneck. It can be a bottleneck if the
driver is sending raw data instead of references to raw data to the JPs. Or
if the JD is running on a machine that is in bad shape, paging, etc. What
is the CPU reported for the JD?



 Thanks in advanced.

 Reshu.

Re: Ruta parallel execution

2014-12-19 Thread Eddie Epstein

Hi Silvestre,

An aggregate deployed with UIMA-AS can be used to run delegate annotators
in parallel, with a few restrictions.
 - the aggregate must be deployed as async=true
 - the parallel delegates must each be running in remote processes
 - the delegates must not modify preexisting FS

As Jens suggests, the resultant latency improvement depends on the remoting
overhead vs processing time. Latency will also be subject to the latency of
the slowest parallel delegate.

Remoting overhead can be reduced using the binary serialization option, but
then all services must have identical typesystems.

Eddie


On Fri, Dec 19, 2014 at 9:10 AM, Silvestre Losada 
silvestre.los...@gmail.com wrote:

 Hi Jens,

 First of all thanks for your detailed answer. UIMA ruta has an option in
 order to execute an analisys engine from ruta script here
 http://goo.gl/ekbhv8 is described. So inside the script you can execute
 the analysis engine and then apply some rules to the annotations created by
 the analysis engine. What I want is to have the option to execute the
 analysis engines in parallel to save time. Would it be possible?

 Kind regards

 On 19 December 2014 at 12:35, Jens Grivolla j+...@grivolla.net wrote:
 
  Hi Silvestre,
 
  there doesn't seem to be anything RUTA-specific in your question. In
  principle, UIMA-AS allows parallel scaleout and merges the results
 (though
  I personally have never used it this way), but there are of course a few
  things to take into account.
 
  First, you will of course need to properly define the dependencies
 between
  your different analysis engines to ensure you always have all then
  necessary information available, meaning that you can only run things in
  parallel that are independent of one another. And then you will have to
 see
  if the overhead from distributing your CAS to several engines running in
  parallel and then merging the results is not greater than just having it
 in
  one colocated pipeline that can pass the information more efficiently. I
  guess you'll have to benchmark your specific application, but maybe
  somebody with more experience can give you some general directions...
 
  Best,
  Jens
 
  On Thu, Dec 18, 2014 at 12:26 PM, Silvestre Losada 
  silvestre.los...@gmail.com wrote:
  
   Well let me explain.
  
   Ruta scripts are really good to work over output of analysis engines,
  each
   analysis engine will make some atomic work and using ruta rules you can
   easily work over generated annotations combine them, remove them...
  What I
   need is to execute several analysis engines in parallel to improve the
   response time, so now the analysis engines are executed sequentially
 and
  I
   want to execute them in parallel, then take the output of all of them
 and
   apply some ruta rules to the output.
  
   would it be possible.
  
   On 17 December 2014 at 18:13, Peter Klügl pklu...@uni-wuerzburg.de
   wrote:
   
Hi,
   
I haven't used UIMA-AS (with ruta) in a real application yet, but I
tested it once for an rc. Did you face any problems?
   
Best
   
Peter
   
Am 17.12.2014 14:34, schrieb Silvestre Losada:
 Hi All,

 Is there any way to execute ruta scripts in parallel, using uima-AS
  aproach? in case yes could you provide me an example.

 Kind regards.

Re: Serializing Specific View to XMI

2014-12-04 Thread Eddie Epstein

I think that is not supported directly. One could use the CasCopier to copy
the view(s) of interest to a new, empty CAS and serialize to xmi file from
that.

Eddie

On Wed, Dec 3, 2014 at 9:04 AM, Jakob Sahlström jakob.sahlst...@gmail.com
wrote:

 Hi,

 I'm dealing with a CAS with multiple views, namely a Gold View and a System
 View. I've been using XmiCasSerializer to save CASes but now I would like
 to save only the contents in the System View. The XmiCasSerializer seems to
 save the whole CAS with all views. Is there an easy way of just saving a
 single view to an xmi file?

 Best,

 Jakob

Re: DUCC doesn't use all available machines

2014-11-30 Thread Eddie Epstein

On Sat, Nov 29, 2014 at 4:46 PM, Simon Hafner reactorm...@gmail.com wrote:

 I've thrown some numbers at it (doubling each) and it's running at
 comfortable 125 procs. However, at about 6.1k of 6.5k items, the procs
 drop down to 30.


125 processes at 8 threads each = 1000 active pipelines. How CPU cores
are these 1000 pipelines running on?

Re: Ducc: Rename failed

2014-11-28 Thread Eddie Epstein

To debug, please add the following option to the job submission:
--all_in_one local

This will run all the code in a single process on the machine doing the
submit. Hopefully the log file and/or console will be more informative.

On Fri, Nov 28, 2014 at 1:41 PM, Simon Hafner reactorm...@gmail.com wrote:

 2014-11-28 10:45 GMT-06:00 Eddie Epstein eaepst...@gmail.com:
  DuccCasCC component has presumably created
  /home/ducc/analysis/txt.processed/5911.txt_0_processed.zip_temp and
 written
  to it?
 I don't know, the _temp file doesn't exist anymore.

  Did you run this sample job in something other than cluster mode?
 I get the same error running on a single machine.

Re: DUCC doesn't use all available machines

2014-11-28 Thread Eddie Epstein

Now you are hitting a limit configured in ducc.properties:

  # Max number of work-item CASes for each job
  ducc.threads.limit = 500

62 job process * 8 threads per process = 496 max concurrent work items.
This was put in to limit the memory required by the job driver. This value
can probably be pushed up in the range of 700-800 before the job driver
will go OOM. There are configuration parameters to increase JD memory:

  # Memory size in MB allocated for each JD
  ducc.jd.share.quantum = 450
  # JD max heap size. Should be smaller than the JD share quantum
  ducc.driver.jvm.args = -Xmx400M -DUimaAsCasTracking

DUCC would have to be restarted for the JD size parameters to take effect.

One of the current DUCC development items is to significantly reduce the
memory needed per work item, and raise the default limit for concurrent
work items by two or three orders of magnitude.



On Fri, Nov 28, 2014 at 6:40 PM, Simon Hafner reactorm...@gmail.com wrote:

 I've put the fudge to 12000, and it jumped immediately to 62 procs.
 However, it doesn't spawn new ones even though it has about 6k items
 left and it doesn't spawn more procs.

 2014-11-17 15:30 GMT-06:00 Jim Challenger chall...@gmail.com:
  It is also possible that RM prediction has decided that additional
  processes are not needed.  It
  appears that there were likely 64 work items dispatched, plus the 6
  completed, leaving only
  30 that were idle.  If these work items appeared to be completing
 quickly,
  the RM would decide
  that scale-up would be wasteful and not do it.
 
  Very gory details if you're interested:
  The time to start a new processes is measured by the RM based on the
  observed initialization time of the processes plus an estimate of how
 long
  it would take to get
  a new process actually running.  A fudge-factor is added on top of this
  because in a large operation
  it is wasteful to start processes (with associated preemptions) that only
  end up doing a few work
  tems.  All is subjective and configurable.
 
  The average time-per-work item is also reported to the RM.
 
  The RM then looks at the number of work items remaining, and the
 estimated
  time needed to
  processes this work based on the above, and if it determines that the job
  will be completed before
  new processes can be scaled up and initialized, it does not scale up.
 
  For short jobs, this can be a bit inaccurate, but those jobs are short :)
 
  For longer jobs, the time-per-work-item becomes increasingly accurate so
 the
  RM prediction tends
  to improve and ramp-up WILL occur if the work-item time turns out to be
  larger than originally
  thought.  (Our experience is that work-item times are mostly uniform with
  occasional outliers, but
  the prediction seems to work well).
 
  Relevant configuration parameters in ducc.properties:
  # Predict when a job will end and avoid expanding if not needed. Set to
  false to disable prediction.
 ducc.rm.prediction = true
  # Add this fudge factor (milliseconds) to the expansion target when using
  prediction
 ducc.rm.prediction.fudge = 12
 
  You can observe this in the rm log, see the example below.  I'm
 preparing a
  guide to this log; for now,
  the net of these two log lines is: the projection for the job in question
  (job 208927) is that 16 processes
  are needed to complete this job, even though the job could use 20
 processes
  at full expanseion - the BaseCap -
  so a max of 16 will be scheduled for it,  subject to fair-share
 constraint.
 
  17 Nov 2014 15:07:38,880  INFO RM.RmJob - */getPrjCap/* 208927  bobuser
 O 2
  T 343171 NTh 128 TI 143171 TR 6748.601431980907 R 1.8967e-02 QR 5043 P
 6509
  F 0 ST 1416254363603*/return 16/*
  17 Nov 2014 15:07:38,880  INFO RM.RmJob - */initJobCap/* 208927 bobuser
 O 2
  */Base cap:/* 20 Expected future cap: 16 potential cap 16 actual cap 16
 
  Jim
 
 
  On 11/17/14, 3:44 PM, Eddie Epstein wrote:
 
  DuccRawTextSpec.job specifies that each job process (JP)
  run 8 analytic pipeline threads. So for this job with 100 work
  items, no more than 13 JPs would ever be started.
 
  After successful initialization of the first JP, DUCC begins scaling
  up the number of JPs using doubling. During JP scale up the
  scheduler monitors the work item completion rate, compares that
  with the JP initialization time, and stops scaling up JPs when
  starting more JPs will not make the job run any faster.
 
  Of course JP scale up is also limited by the job's fair share
  of resources relative to total resources available for all preemptable
  jobs.
 
  To see more JPs, increase the number and/or size of the input text
 files,
  or decrease the number of pipeline threads per JP.
 
  Note that it can be counter productive to run too many pipeline
  threads per machine. Assuming analytic threads are 100% CPU bound,
  running more threads than real cores will often slow down the overall
  document processing rate.
 
 
  On Mon, Nov 17, 2014 at 6:48 AM, Simon Hafner

Re: DUCC org.apache.uima.util.InvalidXMLException and no logs

2014-11-27 Thread Eddie Epstein

Those are the only two log files? Should be a ducc.log (probably with no
more info than on the console), and either one or both of the job driver
logfiles: jd.out.log and jobid-JD-jdnode-jdpid.log. If for some reason the
job driver failed to start, check the job driver agent log (the agent
managing the System/JobDriver reservation) for more info on what happened.

On Wed, Nov 26, 2014 at 10:06 PM, Simon Hafner reactorm...@gmail.com
wrote:

 When launching the Raw Text example application, it doesn't load with
 the following error:

 [ducc@ip-10-0-0-164 analysis]$ MyAppDir=$PWD MyInputDir=$PWD/txt
 MyOutputDir=$PWD/txt.processed ~/ducc_install/bin/ducc_submit -f
 DuccRawTextSpec.job
 Job 50 submitted
 id:50 location:5991@ip-10-0-0-164
 id:50 state:WaitingForDriver
 id:50 state:Completing total:-1 done:0 error:0 retry:0 procs:0
 id:50 state:Completed total:-1 done:0 error:0 retry:0 procs:0
 id:50 rationale:job driver exception occurred:
 org.apache.uima.util.InvalidXMLException at

 org.apache.uima.ducc.common.uima.UimaUtils.getXMLInputSource(UimaUtils.java:246)

 However, there are no logs with a stacktrace or similar, how do I get
 hold of one? The only files in the log directory are:

 [ducc@ip-10-0-0-164 analysis]$ cat logs/50/specified-by-user.properties
 #Thu Nov 27 03:00:57 UTC 2014
 working_directory=/home/ducc/analysis
 process_descriptor_CM=org.apache.uima.ducc.sampleapps.DuccTextCM
 driver_descriptor_CR=org.apache.uima.ducc.sampleapps.DuccJobTextCR
 cancel_on_interrupt=
 process_descriptor_CC_overrides=UseBinaryCompression\=true
 process_descriptor_CC=org.apache.uima.ducc.sampleapps.DuccCasCC
 log_directory=/home/ducc/analysis/logs
 wait_for_completion=
 classpath=/home/ducc/analysis/lib/*
 process_thread_count=8
 driver_descriptor_CR_overrides=BlockSize\=10 SendToLast\=true
 InputDirectory\=/home/ducc/analysis/txt
 OutputDirectory\=/home/ducc/analysis/txt.processed
 process_per_item_time_max=20

 process_descriptor_AE=/home/ducc/analysis/opennlp.uima.OpenNlpTextAnalyzer/opennlp.uima.OpenNlpTextAnalyzer_pear.xml
 description=DUCC raw text sample application
 process_jvm_args=-Xmx3G -XX\:+UseCompressedOops

 -Djava.util.logging.config.file\=/home/ducc/analysis/ConsoleLogger.properties
 scheduling_class=normal
 process_memory_size=4
 specification=DuccRawTextSpec.job

 [ducc@ip-10-0-0-164 analysis]$ cat logs/50/job-specification.properties
 #Thu Nov 27 03:00:57 UTC 2014
 working_directory=/home/ducc/analysis
 process_descriptor_CM=org.apache.uima.ducc.sampleapps.DuccTextCM
 process_failures_limit=20
 driver_descriptor_CR=org.apache.uima.ducc.sampleapps.DuccJobTextCR
 cancel_on_interrupt=
 process_descriptor_CC_overrides=UseBinaryCompression\=true
 process_descriptor_CC=org.apache.uima.ducc.sampleapps.DuccCasCC
 classpath_order=ducc-before-user
 log_directory=/home/ducc/analysis/logs
 submitter_pid_at_host=5991@ip-10-0-0-164
 wait_for_completion=
 classpath=/home/ducc/analysis/lib/*
 process_thread_count=8
 driver_descriptor_CR_overrides=BlockSize\=10 SendToLast\=true
 InputDirectory\=/home/ducc/analysis/txt
 OutputDirectory\=/home/ducc/analysis/txt.processed
 process_initialization_failures_cap=99
 process_per_item_time_max=20

 process_descriptor_AE=/home/ducc/analysis/opennlp.uima.OpenNlpTextAnalyzer/opennlp.uima.OpenNlpTextAnalyzer_pear.xml
 description=DUCC raw text sample application
 process_jvm_args=-Xmx3G -XX\:+UseCompressedOops

 -Djava.util.logging.config.file\=/home/ducc/analysis/ConsoleLogger.properties
 scheduling_class=normal
 environment=HOME\=/home/ducc LANG\=en_US.UTF-8 USER\=ducc
 process_memory_size=4
 user=ducc
 specification=DuccRawTextSpec.job

Re: DUCC web server interfacing

2014-11-21 Thread Eddie Epstein

On Thu, Nov 20, 2014 at 10:01 PM, D. Heinze dhei...@gnoetics.com wrote:

 Eddie... thanks.  Yes, that sounds like I would not have the advantage of
 DUCC managing the UIMA pipeline.


Depends on the definition of managing. DUCC manages the lifecycle of
analytic pipelines running as job processes and as services. There are
differences in how DUCC decides how many instances of each are run. And you
are right that only for jobs will DUCC send work items to the analytic
pipeline.



 To break it down a little for the uninitiated (me),

  1. how do I start a DUCC job that stays resident because it has high
 startup cost (e.g. 2 minutes to load all the resources for the UIMA
 pipeline VS about 2 seconds to process each request)?


Run the pipeline as a service. A service can be configured to start
automatically, as soon as DUCC starts. If the load on the service
increases, DUCC can be told [manually or programmatically] to launch
additional service instances.


 2. once I have a resident job, how do I get the Job Driver to iteratively
 feed references to each next document (as they are received) to the
 resident Job Process?  Because all the input jobs will be archived anyhow,
 I'm okay with passing them through the file system if needed.


The easiest approach is to have an application driver, say a web service,
directly feed input to the service. If using references as input, the same
analytic pipeline could be used both for live processing as a service and
for batch job processing.

DUCC jobs are designed for batch work, where the size of the input
collection is known and the number of job processes will be replicated as
much as possible, given available resources and the job's fair share when
multiple jobs are running.

DUCC services are intended to support job pipelines, for example a large
memory but low latency analytic that can be shared by many job process
instances, or for interactive applications.

Have you looked at creating a UIMA-AS service from a UIMA pipeline?

Eddie

Re: DUCC web server interfacing

2014-11-20 Thread Eddie Epstein

The preferred approach is to run the analytics as a DUCC service, and have
an application driver that feeds the service instances with incoming data.
This service would be a scalable UIMA-AS service, which could have as
many instances as are needed to keep up with the load. The driver would
use the uima-as client API to feed the service. The application driver
could
itself be another DUCC service.

DUCC manages the life cycle of its services, including restarting them on
failure.

Eddie


On Thu, Nov 20, 2014 at 6:45 PM, Daniel Heinze dhei...@san.rr.com wrote:

 I just installed DUCC this week and can process batch jobs.  I would like
 DUCC to initiate/manage one or more copies of the same UIMA pipeline that
 has high startup overhead and keep it/them active and feed it/them with
 documents that arrive periodically over a web service.  Any suggestions on
 the preferred way (if any) to do this in DUCC.



 Thanks / Dan

Re: DUCC web server interfacing

2014-11-20 Thread Eddie Epstein

Ooops, in this case the web server would be feeding the service directly.

On Thu, Nov 20, 2014 at 9:04 PM, Eddie Epstein eaepst...@gmail.com wrote:

 The preferred approach is to run the analytics as a DUCC service, and have
 an application driver that feeds the service instances with incoming data.
 This service would be a scalable UIMA-AS service, which could have as
 many instances as are needed to keep up with the load. The driver would
 use the uima-as client API to feed the service. The application driver
 could
 itself be another DUCC service.

 DUCC manages the life cycle of its services, including restarting them on
 failure.

 Eddie


 On Thu, Nov 20, 2014 at 6:45 PM, Daniel Heinze dhei...@san.rr.com wrote:

 I just installed DUCC this week and can process batch jobs.  I would like
 DUCC to initiate/manage one or more copies of the same UIMA pipeline that
 has high startup overhead and keep it/them active and feed it/them with
 documents that arrive periodically over a web service.  Any suggestions on
 the preferred way (if any) to do this in DUCC.



 Thanks / Dan

Re: DUCC-Un-managed Reservation??

2014-11-18 Thread Eddie Epstein

On Tue, Nov 18, 2014 at 1:05 AM, reshu.agarwal reshu.agar...@orkash.com
wrote:


 Hi,

 I am bit confused. Why we need un-managed reservation? Suppose we give 5GB
 Memory size to this reservation. Can this RAM be consumed by any process if
 required?


Basically yes. See more info about Rogue Process in the duccbook.



 In my scenario,  when all RAMs of Nodes was consumed by JOBs, all
 processes went in waiting state. I need some reservation of RAMs for this
 so that it can not be consumed by shares for Job Processes but if required
 internally it could be used.


Any idea why all process went into waiting state? Did the job details page
show that these JPs were assigned work items?



 Can un-managed reservation be used for this?

 Thanks in advanced.

 Reshu.

Re: DUCC stuck at WaitingForResources on an Amazon Linux

2014-11-15 Thread Eddie Epstein

On Fri, Nov 14, 2014 at 8:11 PM, Simon Hafner reactorm...@gmail.com wrote:

 So to run effectively, I would need more memory, because the job wants
 two shares? ... Yes. With a larger node it works. What would be a
 reasonable memory size for a ducc node?

 Really depends on the application code. Quoting from the DUCC overview at
http://uima.apache.org/doc-uimaducc-whatitam.html

  DUCC is particularly well suited to run large memory Java analytics in
   multiple threads in order to fully utilize multicore machines.

Our experience to date has been with machines 16-256GB and 4-32 CPU cores.
Smaller machines, 8GB or less, have only been used for development of DUCC
itself, with dummy analytics that use little memory and CPU.

Re: DUCC stuck at WaitingForResources on an Amazon Linux

2014-11-13 Thread Eddie Epstein

Simon,

The DUCC resource manager logs into rm.log. Did you look there for reasons
the resources are not being allocated?

Eddie

On Wed, Nov 12, 2014 at 4:07 PM, Simon Hafner reactorm...@gmail.com wrote:

 4 shares total, 2 in use.

 2014-11-12 5:06 GMT-06:00 Lou DeGenaro lou.degen...@gmail.com:
  Try looking at your DUCC's web server.  On the System - Machines page
  do you see any shares not inuse?
 
  Lou.
 
  On Wed, Nov 12, 2014 at 5:51 AM, Simon Hafner reactorm...@gmail.com
 wrote:
  I've set up DUCC according to
  https://cwiki.apache.org/confluence/display/UIMA/DUCC
 
  ducc_install/bin/ducc_submit -f ducc_install/examples/simple/1.job
 
  the job is stuck at WaitingForResources.
 
  12 Nov 2014 10:37:30,175  INFO Agent.LinuxNodeMetricsProcessor -
  process N/A ... Agent Collecting User Processes
  12 Nov 2014 10:37:30,176  INFO Agent.NodeAgent -
  copyAllUserReservations N/A +++ Copying User Reservations
  - List Size:0
  12 Nov 2014 10:37:30,176  INFO Agent.LinuxNodeMetricsProcessor - call
 N/A ** User Process Map Size After
  copyAllUserReservations:0
  12 Nov 2014 10:37:30,176  INFO Agent.LinuxNodeMetricsProcessor - call
 N/A ** User Process Map Size After
  copyAllUserRougeProcesses:0
  12 Nov 2014 10:37:30,182  INFO Agent.LinuxNodeMetricsProcessor - call
N/A
  12 Nov 2014 10:37:30,182  INFO Agent.LinuxNodeMetricsProcessor - call
 N/A
 **
  12 Nov 2014 10:37:30,182  INFO Agent.LinuxNodeMetricsProcessor -
  process N/A ... Agent ip-172-31-7-237.us-west-2.compute.internal
  Posting Memory:4050676 Memory Free:4013752 Swap Total:0 Swap Free:0
  Low Swap Threshold Defined in ducc.properties:0
  12 Nov 2014 10:37:33,303  INFO Agent.AgentEventListener -
  reportIncomingStateForThisNode N/A Received OR Sequence:699 Thread
  ID:13
  12 Nov 2014 10:37:33,303  INFO Agent.AgentEventListener -
  reportIncomingStateForThisNode N/A
  JD-- JobId:6 ProcessId:0 PID:8168 Status:Running Resource
  State:Allocated isDeallocated:false
  12 Nov 2014 10:37:33,303  INFO Agent.NodeAgent - setReservations
  N/A +++ Copied User Reservations - List Size:0
  12 Nov 2014 10:37:33,405  INFO
  Agent.AgentConfiguration$$EnhancerByCGLIB$$cc49880b - getSwapUsage-
   N/A PID:8168 Swap Usage:0
  12 Nov 2014 10:37:33,913  INFO
  Agent.AgentConfiguration$$EnhancerByCGLIB$$cc49880b -
  collectProcessCurrentCPU N/A 0.0 == CPUTIME:0.0
  12 Nov 2014 10:37:33,913  INFO
  Agent.AgentConfiguration$$EnhancerByCGLIB$$cc49880b - process N/A
  --- PID:8168 Major Faults:0 Process Swap Usage:0 Max Swap
  Usage Allowed:-108574720 Time to Collect Swap Usage:0
 
  I'm using a t2.medium instance (2 CPU, ~ 4GB RAM) and the stock Amazon
  Linux (looks centos based).
 
  To install maven (not in the repos)
 
  #! /bin/bash
 
  TEMPORARY_DIRECTORY=$(mktemp -d)
  DOWNLOAD_TO=$TEMPORARY_DIRECTORY/maven.tgz
 
  echo 'Downloading Maven to: ' $DOWNLOAD_TO
 
  wget -O $DOWNLOAD_TO
 
 http://www.eng.lsu.edu/mirrors/apache/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz
 
  echo 'Extracting Maven'
  tar xzf $DOWNLOAD_TO -C $TEMPORARY_DIRECTORY
  rm $DOWNLOAD_TO
 
  echo 'Configuring Envrionment'
 
  mv $TEMPORARY_DIRECTORY/apache-maven-* /usr/local/maven
  echo -e 'export M2_HOME=/usr/local/maven\nexport
  PATH=${M2_HOME}/bin:${PATH}'  /etc/profile.d/maven.sh
  source /etc/profile.d/maven.sh
 
  echo 'The maven version: ' `mvn -version` ' has been installed.'
  echo -e '\n\n!! Note you must relogin to get mvn in your path !!'
  echo 'Removing the temporary directory...'
  rm -r $TEMPORARY_DIRECTORY
  echo 'Your Maven Installation is Complete.'

Re: UIMA DUCC - Multi-machine Installation

2014-11-05 Thread Eddie Epstein

Hi,

There is a default limit of 500 work items dispatched at the same time. How
many dispatched are shown for the job?

Eddie


On Wed, Nov 5, 2014 at 3:11 AM, Thanh Tam Nguyen nthanh...@gmail.com
wrote:

 Hi Eddie,
 Thanks for your email. I followed the documentation and I was able to run
 DUCC jobs using different user instead of user ducc. But while I was
 watching the webserver, I only found one machine running the jobs. In the
 tab SystemMachines, I can see all the machine statuses are up. What
 should I do to run the jobs on all machines?


 Regards,
 Tam

 On Fri, Oct 31, 2014 at 9:37 PM, Eddie Epstein eaepst...@gmail.com
 wrote:

  Hi Tam,
 
  In the install documentation,
  http://uima.apache.org/d/uima-ducc-1.0.0/installation.html,
  the section Multi-User Installation and Verification describes how to
  configure setuid-root
  for ducc_ling so that DUCC jobs are run as the submitting user instead of
  user ducc.
 
  The setuid-root ducc_ling should be put on every DUCC node, in the same
  place,
  and ducc.properties updated to point at that location.
 
  Eddie
 
 
  On Fri, Oct 31, 2014 at 3:54 AM, Thanh Tam Nguyen nthanh...@gmail.com
  wrote:
 
   Hi Eddie,
   Would you tell me more details how to setup DUCC for multiuser mode?
  FYI, I
   have successfully setup and ran my UIMA analysis engine on single user
   mode. I also followed DUCCBOOK to setup ducc_ling but I am sure how to
  get
   it worked on a cluster of machines.
  
   Thanks,
   Tam
  
   On Thu, Oct 30, 2014 at 11:08 PM, Eddie Epstein eaepst...@gmail.com
   wrote:
  
The $DUCC_RUNTIME tree needs to be on a shared filesystem accessible
  from
all machines.
For single user mode ducc_ling could be referenced from there as
 well.
But for multiuser setup, ducc_ling needs setuid and should be
 installed
   on
the root drive.
   
Eddie
   
On Thu, Oct 30, 2014 at 10:08 AM, James Baker 
 james.d.ba...@gmail.com
  
wrote:
   
 I've been working through the installation of UIMA DUCC, and have
 successfully got it set up and running on a single machine. I'd now
   like
to
 move to running it on a cluster of machines, but it isn't clear to
 me
from
 the installation guide as to whether I need to install DUCC on each
   node,
 or whether ducc_ling is the only thing that needs installing on the
 non-head nodes.

 Could anyone shed some light on the process please?

 Thanks,
 James

Re: UIMA DUCC - Multi-machine Installation

2014-10-31 Thread Eddie Epstein

Hi Tam,

In the install documentation,
http://uima.apache.org/d/uima-ducc-1.0.0/installation.html,
the section Multi-User Installation and Verification describes how to
configure setuid-root
for ducc_ling so that DUCC jobs are run as the submitting user instead of
user ducc.

The setuid-root ducc_ling should be put on every DUCC node, in the same
place,
and ducc.properties updated to point at that location.

Eddie


On Fri, Oct 31, 2014 at 3:54 AM, Thanh Tam Nguyen nthanh...@gmail.com
wrote:

 Hi Eddie,
 Would you tell me more details how to setup DUCC for multiuser mode? FYI, I
 have successfully setup and ran my UIMA analysis engine on single user
 mode. I also followed DUCCBOOK to setup ducc_ling but I am sure how to get
 it worked on a cluster of machines.

 Thanks,
 Tam

 On Thu, Oct 30, 2014 at 11:08 PM, Eddie Epstein eaepst...@gmail.com
 wrote:

  The $DUCC_RUNTIME tree needs to be on a shared filesystem accessible from
  all machines.
  For single user mode ducc_ling could be referenced from there as well.
  But for multiuser setup, ducc_ling needs setuid and should be installed
 on
  the root drive.
 
  Eddie
 
  On Thu, Oct 30, 2014 at 10:08 AM, James Baker james.d.ba...@gmail.com
  wrote:
 
   I've been working through the installation of UIMA DUCC, and have
   successfully got it set up and running on a single machine. I'd now
 like
  to
   move to running it on a cluster of machines, but it isn't clear to me
  from
   the installation guide as to whether I need to install DUCC on each
 node,
   or whether ducc_ling is the only thing that needs installing on the
   non-head nodes.
  
   Could anyone shed some light on the process please?
  
   Thanks,
   James

Re: UIMA DUCC - Multi-machine Installation

2014-10-30 Thread Eddie Epstein

The $DUCC_RUNTIME tree needs to be on a shared filesystem accessible from
all machines.
For single user mode ducc_ling could be referenced from there as well.
But for multiuser setup, ducc_ling needs setuid and should be installed on
the root drive.

Eddie

On Thu, Oct 30, 2014 at 10:08 AM, James Baker james.d.ba...@gmail.com
wrote:

 I've been working through the installation of UIMA DUCC, and have
 successfully got it set up and running on a single machine. I'd now like to
 move to running it on a cluster of machines, but it isn't clear to me from
 the installation guide as to whether I need to install DUCC on each node,
 or whether ducc_ling is the only thing that needs installing on the
 non-head nodes.

 Could anyone shed some light on the process please?

 Thanks,
 James

Re: Could UIMA AS client send custom key value parameters to annotator?

2014-10-01 Thread Eddie Epstein

There is no mechanism for a uima-as client to modify the result
specification of a remote service. Since type/feature control cannot
indicate many other behavioral characteristics, like speed vs accuracy
tradeoffs, the suggested approach for dynamic control is to use dedicated
feature structures in the submitted CAS.

Eddie

On Tue, Sep 30, 2014 at 11:13 PM, jeffery yuan yuanyu...@gmail.com wrote:

 Hi, Dear UIMA Developer and Users:

 Thanks advance for any help.

 I am using UIMA AS and the RegExAnnotator.
 I am wondering whether the client can send some custom key value
 parameters to
 the annotator.

 The real function I want to implement is:
 --
 As there are multiple(10+) regex defined in the regex.pear, client may be
 only
 interested in several entity types, so we want RegExAnnotator only run
 regex
 for types that client specifies.

 If I use synchronous UIMA API, I can set ResultSpecification in client.
 AnalysisEngine ae;
 ResultSpecification rs = createResultSpecification(types);
 ae.process(cas, rsf);

 RegExAnnotator then check
 getResultSpecification().getResultTypesAndFeatures(), then only run needed
 regex.

 But as I use UIMA AS(ae.sendCAS(cas);), there is no API to set
 ResultSpecification or specify custom key value parameters.

 Thanks...

Re: Uima AS out of memory

2014-08-20 Thread Eddie Epstein

When using deployAsyncService.sh to start a UIMA AS service, the default
Java heap size is Xmx800M. To override this, export an environment
parameter UIMA_JVM_OPTS with JVM arguments. For example:
   $ export UIMA_JVM_OPTS=-Xmx6G -Xms2G
   $ deployAsyncService.sh myDeploymentDescriptor.xml



On Wed, Aug 20, 2014 at 2:04 AM, Swirl lriwsw...@gmail.com wrote:

 Hi,
 I have deployed a AE onto a Uima AS node.
 But when I use it to analyse some documents, i got OutOfMemoryError: Java
 heap
 space.
 I know that the AE is taking large amount of memory due to it loading many
 resources.

 How can I increase the memory allocated to it in the Uima AS so that I can
 avoid the error?

 Thanks.

Re: UIMA AS NullPointerException in CasDefinition constructor

2014-08-04 Thread Eddie Epstein

Hi,

Very good. The Deploy_MeetingDetectorTAE.xml does not have a CR in the
aggregate either. It is designed to be used with a driver as you have done,
or by plugging a CR into the UimaAsynchronousEngine as is done in the
README example. This approach is scalable by increasing the number of
processing threads in the deployment descriptor, and by launching
additional instances of the service

Another example that plugs a CR directly into an AE, also referenced in
README, is
 $UIMA_HOME/examples/deploy/as/MeetingFinderAggregate.xml
 $UIMA_HOME/examples/deploy/as/Deploy_MeetingFinder.xml
This approach fits better the scaling model used by DUCC, where individual
services have a CasMultiplier at the front of each service.

Eddie





On Mon, Aug 4, 2014 at 10:52 AM, Egbert van der Wal e...@pointpro.nl
wrote:

 Just to post an update on my own message:

 In the full version, I was able to fix it by not adding the Collection
 Reader to
 the AsychronousEngine but just initializing it without and afterwards doing
 it manually:

 reader.initialize();
 while (reader.hasNext())
   ae.sendCAS(reader.getNext(ae.getCAS()));

 instead of just calling:

 ae.process();

 makes it work as it should. So it seems that in my situation, there is a
 problem with adding the CollectionReader to the engine. Not such a big
 deal as my new approach works, but it would be nice to use the
 CollectionReader as it's meant to be used.

 Thanks for any suggestions!

 Regards,

 Egbert


 On Monday, August 04, 2014 02:07:18 PM you wrote:
  Hi Eddie,
 
  Thanks for the suggestion. I have limited time to work on this so my
  response may be slow now and again, but I'm still working on it. Your
 input
  is very much appreciated!
 
  First of all, the command:
 
  runRemoteAsyncAE.sh/cmd tcp://localhost:61616
  MeetingDetectorTaeQueue \
   -d Deploy_MeetingDetectorTAE.xml \
   -c
 
 $UIMA_HOME/examples/descriptors/collection_reader/FileSystemCollectionR
  eader.xml
 
  completes without any errors, so the example seems to be running fine.
 
  I've been fiddling around with it and I was able to pinpoint the problem
 to
  the collection reader. I have a collection reader (subclassed from
  CollectionReader_ImplBase) that is added to the AsynchronousEngine.
  However, it does not seem to be the specific implementation.
 
  I've reduced all code / annotators etc to a very basic set that still
 shows
  the problem, where the AE only has one (very basic) annotator and the
  Collection Reader actually doesn't do anything except stating that
 there's
  nothing left to do.
 
  I'll attach the Minimum Working Examples (or actually, not-working
  examples) that I've constructed to pinpoint the problem.
 
  Commenting out line 62 (the call to setCollectionReader) 'fixes' the
  problem. However, as you can see, the implementation of MWEReader
  doesn't do anything at all, so I don't really see why it would cause this
  trouble.
 
  Any ideas on what is causing this problem?
 
  Thanks,
 
  Egbert
 
  On Monday, July 28, 2014 05:14:39 PM Eddie Epstein wrote:
   Hi Egbert,
  
   The README file for UIMA-AS shows an application example with
   Deploy_MeetingDetectorTAE.xml.Does that run OK for you?
  
   Assuming yes, can you give more details about the scenario, perhaps
 
  the
 
   explicit commands used? The descriptors used?
  
   Eddie
  
  
  
   On Mon, Jul 28, 2014 at 11:46 AM, Egbert van der Wal
 
  e...@pointpro.nl
 
   wrote:
 Hi,
   
I'm trying to convert an existing and functional UIMA pipeline to a
 UIMA
AS pipeline.
   
   
   
I'm getting there, I created deployment descriptors for the
 annotators
 
  and
 
when running my application all individual annotators are launched
correctly. The composite analysis engine also loads fine but I'm
 getting
 
  a
 
NullPointerException when calling initialize(deployCtx) on the
UimaAsEngine
on line 66. See the attached text document for the full exception.
   
   
   
   
   
I found a similar issue in the bug tracker which was fixed in UIMA AS
 
  2.3:
https://issues.apache.org/jira/browse/UIMA-1376
   
   
   
But this seems to arise in mergeTypeSystem and this does not seem
 
  to be
 
the case in my situation. The line number is the same however.

Re: Building UIMA-CPP on (K)Ubuntu 14.04

2014-07-29 Thread Eddie Epstein

Issues (bugs, feature requests, ...) should be added to the Apache JIRA
system,
https://issues.apache.org/jira/browse/uima

Project=UIMA, component=C++ Framework
https://issues.apache.org/jira/browse/UIMA/component/12311616

Thanks,
Eddie


On Mon, Jul 28, 2014 at 4:30 AM, Egbert van der Wal e...@pointpro.nl
wrote:

 Thanks again for the suggestion.

 As for the rest of your response: the main reason I was asking is that I
 have not
 upgraded my UIMA to 2.4.2 or even 2.6.0 because there is no release of
 uimacpp for
 those versions. If I understand your response correctly, it seems that
 even though
 uimacpp is versioned at 2.4.0 it should work just as well is with uimaj-as
 2.6.0?

 Of course, at least one feature request would be to keep the dependencies
 up to date,
 for example, the dependency on APR  1.5. While the change of the
 build-script isn't
 that hard, it does make it harder to have newcomers to uimacpp adopt it.

 Another feature request: make it possible to load annotators from shared
 libraries
 without having to change the LD_LIBRARY_PATH: e.g., make it possible to
 specify a full
 / relative path in the annotator descriptors instead of just a library
 name. Where should
 I report these feature requests?

 Thanks again!

 Egbert


 On Tuesday, July 22, 2014 05:04:36 PM Eddie Epstein wrote:
  Good to hear the build worked. UIMACPP implements only a core subset of
  UIMA functionality, basically the CAS API and the ability to create
  primitive and aggregate analysis engines. There are also two alternative
  methods to integrate native code wrapped with uimapp with a uimaj
 pipeline:
  a JNI interface and a JMS service interface. I am not aware of any
 changes
  to the CAS API or to UIMA-AS that stop uimacpp from working, nor of any
 new
  feature requests for uimacpp.
 
  Regards,
  Eddie
 
  On Tue, Jul 22, 2014 at 3:38 PM, Egbert van der Wal e...@pointpro.nl
 
  wrote:
   This helps indeed. A weird hack, but at least the configure script
   completes
   and the build succeeds. Thanks a lot!
  
   I noticed there haven't been any commits to the uimacpp repository in
   2012. It
   is indeed the case that there is noone actively working on it or has
   development been moved to a different repository?
  
   I have also been able to run the precompiled binaries, but that of
 course
   references the non-system libraries bundled with the package and since
 my
   annotators are also linking against the system ICU library, this will
   probably
   result in conflicts.
  
   Kind regards,
  
   Egbert
  
   Op dinsdag 22 juli 2014 09:30:41 schreef Eddie Epstein:
Looking at a build on RHEL, jni.h was resolved with:
--with-jdk=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include
-I/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include/linux
which follows the instructions in README.4src
   
I had built ICU [and all other dependencies] and installed it in a
  
   private
  
directory by configuring with --prefix command. Changing uimacpp's
configure to use icu-config looks interesting.
   
Hope this helps,
Eddie
   
   
   
   
On Tue, Jul 22, 2014 at 2:51 AM, Egbert van der Wal 
 e...@pointpro.nl
   
wrote:
 Hi,

 I've been trying to add an annotator in C++ to an Annotation
 Engine in
 Java.
 However, building UIMA-CPP is not a trivial task, so it seems.

 So far, I've identified dependencies on ActiveMQ-CPP, APR, a Java
 JRE,
 Xerces
 and ICU. Maybe there's more key dependencies but those do not
 appear
  
   to be
  
 a
 problem on Ubuntu.

 libxerces was easy to fix as a compatible version is in the Ubuntu
 repository.

 APR is harder: Ubuntu 14.04 ships with 1.5.x while the configure
 script
 checks
 for 1.2, 1.3 or 1.4. Hacking the configure-script to also accept
 1.5.x
 works
 but I didn't get to compiling yet so I don't know about the
 API-differences
 and if this will work.

 ActiveMQ-CPP is not in the Ubuntu repository. I had to locate and
 built
 this
 myself, but this actually didn't prove to be so hard.

 ICU is harder. The configure script wants a --with-icu path, but
 then
 assumes
 other facts. In Ubuntu, icu-config is located in /usr/bin while the
  
   header
  
 filers are located in /usr/include/x86-64-linux-gnu/unicode/. The
 configure-
 script seems to have problems to recognize this difference. I
 would've

Re: UIMA AS NullPointerException in CasDefinition constructor

2014-07-28 Thread Eddie Epstein

Hi Egbert,

The README file for UIMA-AS shows an application example with
Deploy_MeetingDetectorTAE.xml.Does that run OK for you?

Assuming yes, can you give more details about the scenario, perhaps the
explicit commands used? The descriptors used?

Eddie



On Mon, Jul 28, 2014 at 11:46 AM, Egbert van der Wal e...@pointpro.nl
wrote:

  Hi,



 I'm trying to convert an existing and functional UIMA pipeline to a UIMA
 AS pipeline.



 I'm getting there, I created deployment descriptors for the annotators and
 when running my application all individual annotators are launched
 correctly. The composite analysis engine also loads fine but I'm getting a
 NullPointerException when calling initialize(deployCtx) on the UimaAsEngine
 on line 66. See the attached text document for the full exception.





 I found a similar issue in the bug tracker which was fixed in UIMA AS 2.3:



 https://issues.apache.org/jira/browse/UIMA-1376



 But this seems to arise in mergeTypeSystem and this does not seem to be
 the case in my situation. The line number is the same however.



 Any clues on where I should look for the solution? Are my descriptors
 faulty? Is the Java code faulty? Or is this a bug in UIMA AS 2.4.0? How can
 I debug this issue?



 Thanks,



 Egbert

Re: Passing additional parameters through to CPE components

2014-07-24 Thread Eddie Epstein

A CPE descriptor can override configuration parameters defined in any
integrated components.
Documentation a little bit below
http://uima.apache.org/d/uimaj-2.6.0/references.html#ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual
3.6.1.2. configurationParameterSettings Element

This element provides a way to override the contained Analysis Engine's
parameters settings. Any entry specified here must already be defined;
values specified replace the corresponding values for each parameter. For
Cas Processors, this mechanism is only available when they are deployed in “
integrated” mode. For Collection Readers and Initializers, it always is
available.

On Thu, Jul 24, 2014 at 8:19 AM, James Baker james.d.ba...@gmail.com
wrote:

Is it possible to provide additional configuration parameters in a CPE
descriptor XML file that aren't specified in the annotator/collection
reader descriptor XML file?

I have a collection reader that accepts the classname of a class to use to
do the content extraction as a parameter. This works fine, but I'd like to
be able to pass additional parameters to the content extractor via the XML.
The parameters will be dependant on the content extractor though, so I
can't specify them in the collection reader descriptor. For example,
ContentExtractor1 might need a parameter 'encoding', and ContentExtractor2
might need a parameter 'baseUrl'.

I have been able to achieve this with UimaFIT by creating the collection
reader without the XML and injecting the parameters, but when I try and do
it from the XML file the parameters don't make it through to my content
extractor (I pass the UimaContext object through to the content extractor).
I suspect they might be being ignored by UIMA because they aren't in the
descriptor. How can I work around this?

Thanks,
James

Re: Passing additional parameters through to CPE components

2014-07-24 Thread Eddie Epstein

Right, the only way for encompassing descriptors (like aggregates or
CPE) to effect configuration parameters is via overrides.

Eddie

On Thu, Jul 24, 2014 at 11:31 AM, james.d.ba...@gmail.com wrote:

I think you’ve misunderstood my question - I’m not asking whether I can
override defined parameters, I’m asking if I can provide additional
configuration parameters that aren’t defined in a descriptor file. Let me
give an example:

MyCollectionReader.xml defines the following properties:
folder [String] - The folder to process files from
classname [String] - The qualified class name of a class
implementing my ContentExtractor interface

MyCpe.xml uses MyCollectionReader.xml and provides the following
properties, including some that MyContentExtractor uses but aren’t defined
above:
folder - /opt/test
classname - test.MyContentExtractor
baseUrl - http://www.example.com

The parameter baseUrl, although it is specified in the MyCpe.xml file,
isn’t defined in MyCollectionReader.xml because it is specific to the
MyContentExtractor class and not necessarily known at design time. However,
UIMA isn’t passing it through to UimaContext presumably because it isn’t
defined in the MyCollectionReader.xml.

Hope that helps clear it up.

On 24 Jul 2014, at 14:51, Eddie Epstein eaepst...@gmail.com wrote:

A CPE descriptor can override configuration parameters defined in any
integrated components.
Documentation a little bit below

http://uima.apache.org/d/uimaj-2.6.0/references.html#ugr.ref.xml.cpe_descriptor.descriptor.cas_processors.individual
3.6.1.2. configurationParameterSettings Element

This element provides a way to override the contained Analysis Engine's
parameters settings. Any entry specified here must already be defined;
values specified replace the corresponding values for each parameter. For
Cas Processors, this mechanism is only available when they are deployed
in “
integrated” mode. For Collection Readers and Initializers, it always is
available.

On Thu, Jul 24, 2014 at 8:19 AM, James Baker james.d.ba...@gmail.com
wrote:

Is it possible to provide additional configuration parameters in a CPE
descriptor XML file that aren't specified in the annotator/collection
reader descriptor XML file?

I have a collection reader that accepts the classname of a class to use
to
do the content extraction as a parameter. This works fine, but I'd like
to
be able to pass additional parameters to the content extractor via the
XML.
The parameters will be dependant on the content extractor though, so I
can't specify them in the collection reader descriptor. For example,
ContentExtractor1 might need a parameter 'encoding', and
ContentExtractor2
might need a parameter 'baseUrl'.

I have been able to achieve this with UimaFIT by creating the collection
reader without the XML and injecting the parameters, but when I try and
do
it from the XML file the parameters don't make it through to my content
extractor (I pass the UimaContext object through to the content
extractor).
I suspect they might be being ignored by UIMA because they aren't in the
descriptor. How can I work around this?

Thanks,
James

Re: Building UIMA-CPP on (K)Ubuntu 14.04

2014-07-22 Thread Eddie Epstein

Looking at a build on RHEL, jni.h was resolved with:
--with-jdk=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include
-I/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/include/linux
which follows the instructions in README.4src

I had built ICU [and all other dependencies] and installed it in a private
directory by configuring with --prefix command. Changing uimacpp's
configure to use icu-config looks interesting.

Hope this helps,
Eddie




On Tue, Jul 22, 2014 at 2:51 AM, Egbert van der Wal e...@pointpro.nl
wrote:

 Hi,

 I've been trying to add an annotator in C++ to an Annotation Engine in
 Java.
 However, building UIMA-CPP is not a trivial task, so it seems.

 So far, I've identified dependencies on ActiveMQ-CPP, APR, a Java JRE,
 Xerces
 and ICU. Maybe there's more key dependencies but those do not appear to be
 a
 problem on Ubuntu.

 libxerces was easy to fix as a compatible version is in the Ubuntu
 repository.

 APR is harder: Ubuntu 14.04 ships with 1.5.x while the configure script
 checks
 for 1.2, 1.3 or 1.4. Hacking the configure-script to also accept 1.5.x
 works
 but I didn't get to compiling yet so I don't know about the API-differences
 and if this will work.

 ActiveMQ-CPP is not in the Ubuntu repository. I had to locate and built
 this
 myself, but this actually didn't prove to be so hard.

 ICU is harder. The configure script wants a --with-icu path, but then
 assumes
 other facts. In Ubuntu, icu-config is located in /usr/bin while the header
 filers are located in /usr/include/x86-64-linux-gnu/unicode/. The
 configure-
 script seems to have problems to recognize this difference. I would've
 thought
 that just having the icu-config script in the path would be sufficient, as
 icu-config spits out the rest of the required information, but this doesn't
 seem to be the case.

 Java JRE is also a problem. It wants jni.h but is not able to locate it.
 Ubuntu installs JREs in /usr/lib/jvm/name of JVM/ but specyfing, for
 example, --with-jre=/usr/lib/jvm/java-7-oracle/ or --with-
 jre=/usr/lib/jvm/java-7-openjdk-amd64/ does not work.

 The corresponding locations on my system of jni.h are:

 /usr/lib/jvm/java-7-openjdk-amd64/include/jni.h
 /usr/lib/jvm/java-7-oracle/include/jni.h


 Is there anyone that succesfully built UIMACPP 2.4 on Ubuntu 14.04? Is
 there a
 new version of UIMACPP available that works better on Ubuntu 14.04? Does
 anyone know of a PPA providing compiled UIMACPP packages for Ubuntu 14.04?

 Thanks for any assistance!

 Kind regards,

 Egbert van der Wal

Re: Is there a way to tell UIMA component to only extract some kind of entities when run opennlp.pear?

2014-06-11 Thread Eddie Epstein

Hi Jeffery,

According the info at
http://uima.apache.org/d/uimaj-2.6.0/tutorials_and_users_guides.html#ugr.tug.aae.result_specification_setting

   The default Result Specification is taken from the Engine's output
Capability Specification.

So it should be possible to deploy the UIMA-AS service with a particular
ResultSpecification,
if a static configuration is all that is needed.

Using the ResultSpecification to control annotator behavior is quite
limited;
consider wanting a speed vs accuracy knob. A more general static solution
would be based
on configuration parameters, and a dynamic solution would put control
information into the CAS.

Eddie




On Thu, Jun 5, 2014 at 5:20 PM, Jeffery yuanyun.ke...@gmail.com wrote:

 Marshall Schor msa@... writes:

 
  UIMA's descriptors include a section under the XML capabilities element
 where
  the descriptor may specify inputs and outputs.  These end up informing
 the
  ResultSpecification which is provided to the annotator.  The
 ResultSpecification
  can be queried by the annotator code to see what the annotator ought to
 produce.
 
  This is used, for example by sample annotators in the examples project:
 TutorialDateTime
 RegExAnnotator
 PersonTitleAnnotator
 
  to control what the annotators produce.
 
  This behavior, on the part of annotators, is optional - that is, an
 annotator
  might be written to ignore the ResultSpecification.
 
  So the key may be to update the annotators to take account of the
  ResultSpecification.
 
  For more background, see
  http://uima.apache.org/d/uimaj-

 2.6.0/tutorials_and_users_guides.html#ugr.tug.aae.result_specification_setti
 ng
 
  which discusses the ResultSpecification further.
 
  -Marshall

 Thanks, Marshall

I tried your suggestions, and it works very well.
Recently, I am looking into UIMA-AS, I am wonderring whether we can do
 same thing in uima-as. But seems UIMA-AS doesn't use ResultSpecificatio:
 the
 sendCas method doesn't accept ResultSpecificatio.

   String casId = asAE.sendCAS(cas);

 Thanks again for your great help, Marshall.
 -- Seems my last thank-you post somehow was gone.

Re: Sofa-unaware AEs that create new views in an AAE

2014-04-22 Thread Eddie Epstein

The current design supports passing a specific view to an annotator:
map the desired view to the default view and do not declare the
annotator view aware by declaring input or output sofas.

An alternate, unambiguous design would be that the default view
should always be delivered to the process method. Is this a better
model for you?

Eddie




On Tue, Apr 22, 2014 at 6:47 AM, Peter Klügl pklu...@uni-wuerzburg.dewrote:

 Am 18.04.2014 15:23, schrieb Eddie Epstein:
  On Thu, Apr 17, 2014 at 9:17 AM, Peter Klügl pklu...@uni-wuerzburg.de
 wrote:
 
  Am 17.04.2014 15:01, schrieb Eddie Epstein:
  Hi Peter,
 
  The logic is that since a sofa aware component may have one or
  more input and/or output views, such a component needs to use
  getView to specify which to use.
 
  For sofa aware delegates, sofa mapping enables the delegate to
  hard wire input and/or output View names in annotator code (or
  annotator config parameters) and then have the real View names
  assigned via mapping in the aggregate.
  Is the real view name in the mapping important at all since the view get
  accessed by the implementation in the process() method?
 
  The real view name is what will be used when the CAS is serialized
  to a file or to a remote service.
 


 Yes, but that has nothing to do with the mapping, which is still obsolete.


  I don't see the effect of the mapping to the default input view of an
  sofa aware AE without input view capabilities at all. The mapping says
  view1 is linked, but another one arrives.
 
  The input view for a sofa aware component is always the base CAS view,
  for the reason given above.
 

 I don't see the reasons. Why shouldn't the analysis engine get the view
 mapped in the aggregate? If the analysis engine has more input views,
 then it gets the base view and still can access them in the
 implementation. It actually has to anyway right now. If it has only one
 (or only the default view), then it can directly use the given one
 without a static implementation (getView(name)) or an additional
 parameter. I think this would enable the creation of better components
 and I actually have a use case right now.


  So, the best practice is to introduce a parameter for specifying the
  input view? In case that the AE implementation should be used several
  times in an AAE for different views.
 
  Many if not most view aware components I've seen do not have a single
  input view.


 Ruta has some, which do not work in pipelines when cascaded.

 Maybe I missed something, but I do not yet see any reasons why it should
 get the base view.

 Peter


  Eddie

Re: Sofa-unaware AEs that create new views in an AAE

2014-04-18 Thread Eddie Epstein

On Thu, Apr 17, 2014 at 9:17 AM, Peter Klügl pklu...@uni-wuerzburg.dewrote:

 Am 17.04.2014 15:01, schrieb Eddie Epstein:
  Hi Peter,
 
  The logic is that since a sofa aware component may have one or
  more input and/or output views, such a component needs to use
  getView to specify which to use.
 
  For sofa aware delegates, sofa mapping enables the delegate to
  hard wire input and/or output View names in annotator code (or
  annotator config parameters) and then have the real View names
  assigned via mapping in the aggregate.

 Is the real view name in the mapping important at all since the view get
 accessed by the implementation in the process() method?


The real view name is what will be used when the CAS is serialized
to a file or to a remote service.


 I don't see the effect of the mapping to the default input view of an
 sofa aware AE without input view capabilities at all. The mapping says
 view1 is linked, but another one arrives.


The input view for a sofa aware component is always the base CAS view,
for the reason given above.



 So, the best practice is to introduce a parameter for specifying the
 input view? In case that the AE implementation should be used several
 times in an AAE for different views.


 Peter


Many if not most view aware components I've seen do not have a single
input view.

Eddie

Re: problem in calling DUCC Service with ducc_submit

2014-04-01 Thread Eddie Epstein

Declaring a service dependency does not affect application code paths. The
job still needs to connect to the service in the normal way.

DUCC uses services dependency for several reasons: to automatically start
services when needed by a job; to not give resources to a job or service
for which a dependent service is not running; and to post a warning on
running jobs when a dependent service goes bad.

Eddie


On Tue, Apr 1, 2014 at 1:27 AM, reshu.agarwal reshu.agar...@orkash.comwrote:


 Hi,

 I am again in a problem. I have successfully deployed DUCC UIMA AS Service
 using ducc_service. The service status is available with good health. If I
 try to use my this service using parameter service_dependency to my Job in
 ducc_submit then it is not showing any error but executes only the DB
 Collection Reader not this service.

 --
 Thanks,
 Reshu Agarwal

Re: Cas Timeout Exception in DUCC

2014-03-31 Thread Eddie Epstein

Reshu,

Please look in the logfile of the job process. Maybe 10 minutes is still
not enough?

Eddie


On Mon, Mar 31, 2014 at 2:42 AM, reshu.agarwal reshu.agar...@orkash.comwrote:

 On 03/28/2014 05:36 PM, Eddie Epstein wrote:

 There is a job specification parameter:
 --process_per_item_time_max integer
Maximum elapsed time (in minutes) for processing one CAS.

 Try setting that to something big enough.

 Eddie


 On Fri, Mar 28, 2014 at 6:18 AM, reshu.agarwal reshu.agar...@orkash.com
 wrote:

  On 03/28/2014 03:23 PM, reshu.agarwal wrote:

  Cas Timed-out on hos

  Hi,

 JD.log contains this:

 Mar 28, 2014 3:07:10 PM org.apache.uima.aae.delegate.Delegate$1
 Delegate.TimerTask.run
 WARNING: Timeout While Waiting For Reply From Delegate:ducc.jd.queue.1887
 Process CAS Request Timed Out. Configured Reply Window Of 180,000. Cas
 Reference Id:-139073d5:145080818b3:-7f7c
 Mar 28, 2014 3:07:10 PM org.apache.uima.adapter.jms.
 client.ActiveMQMessageSender
 run
 INFO: UIMA AS Client Message Dispatcher Sending GetMeta Ping To the
 Service
 Mar 28, 2014 3:07:10 PM org.apache.uima.adapter.jms.client.
 BaseUIMAAsynchronousEngineCommon_impl sendCAS
 INFO: Uima AS Client Sent PING Message To Service: ducc.jd.queue.1887
 Mar 28, 2014 3:07:10 PM org.apache.uima.adapter.jms.
 client.ClientServiceDelegate
 handleError
 WARNING: Process Timeout - Uima AS Client Didn't Receive Process Reply
 Within Configured Window Of:180,000 millis
 Mar 28, 2014 3:07:10 PM org.apache.uima.adapter.jms.client.
 BaseUIMAAsynchronousEngineCommon_impl notifyOnTimout
 WARNING: Request To Process Cas Has Timed-out. Service
 Queue:ducc.jd.queue.1887. Broker: tcp://S1:61616?wireFormat.
 maxInactivityDuration=0jms.useCompression=truecloseAsync=false Cas
 Timed-out on host: 192.168...
 Mar 28, 2014 3:07:10 PM org.apache.uima.adapter.jms.client.
 BaseUIMAAsynchronousEngineCommon_impl sendAndReceiveCAS
 INFO: UIMA AS Handling Exception in sendAndReceive(). CAS
 hashcode:1432981672. ThreadMonitor Released Semaphore For Thread ID:42
 org.apache.uima.ducc.common.jd.plugin.AbstractJdProcessExceptionHandler:
 org.apache.uima.aae.error.UimaASProcessCasTimeout: UIMA AS Client Timed
 Out Waiting for Reply From Service:ducc.jd.queue.1887
 Broker:tcp://S1:61616?
 wireFormat.maxInactivityDuration=0jms.useCompression=true
 closeAsync=false
 org.apache.uima.ducc.common.jd.plugin.AbstractJdProcessExceptionHandler:
 directive=ProcessContinue_CasNoRetry, [192.168.:5729]=1, [total]=1

 --
 Thanks,
 Reshu Agarwal


  Dear Eddie,

 If I set it to greater then 3 minutes i.e 10 minutes, it still showed
 timed out error and did not terminate the job by itself. So, it did not
 work for me.


 --process_per_item_time_max integer
   Maximum elapsed time (in minutes) for processing one CAS.


 --
 Thanks,
 Reshu Agarwal

1 2 >

1 - 100 of 188 matches

Mail list logo