Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-25 Thread Erik Fäßler
Thanks for all the input! I have some reading to do now ;-)

Best,

Erik

> On 21 Apr 2017, at 23:22, Eddie Epstein <eaepst...@gmail.com> wrote:
> 
> Hi Erik,
> 
> A few words about DUCC and your application. DUCC is a cluster controller
> that includes a resource manager and 3 applications: batch processing, long
> running services and singleton processes.
> 
> The batch processing application consists of a users CollectionReader which
> defines work items and a users aggregate for processing work items that can
> be replicated as desired across the cluster of machines. DUCC manages the
> remote process scale out and distribution of work items. The aggregate can
> be vertically scaled within each process so that in-heap data can be shared
> by multiple instances of the aggregate. UIMA-AS is not required for this
> simple threading model.
> 
> For most applications a work item is itself a collection, a CAS containing
> references to the data to be processed, where the collection size is
> designed to have small enough granularity to support scale out but big
> enough granularity to avoid bottlenecks.
> 
> The users aggregate normally has an initial CasMultiplier that reads the
> input data and creates the CASes to be fed to the rest of the pipeline.
> When all children CASes are finished processing the work item CAS is routed
> to the aggregate's CasConsumer to finalize the collection. DUCC considers
> the work item complete only when the work item CAS is successfully
> processed.
> 
> The system is quite robust to errors: uncaught exceptions, analytics
> crashing, machines crashing, etc.
> 
> Regards,
> Eddie
> 
> 
> On Fri, Apr 21, 2017 at 2:12 PM, Olga Patterson <olga.patter...@utah.edu>
> wrote:
> 
>> Erik,
>> 
>> My team at the VA have developed an easy way of implementing UIMA AS
>> pipelines and scaling them to a large number of nodes - using Leo framework
>> that extends UIMA AS 2.8.1. We have run pipelines on over 200M documents
>> scaled across multiple nodes with dozens of service instances and it
>> performs great.
>> 
>> Here is some info:
>> http://department-of-veterans-affairs.github.io/Leo/
>> 
>> The documentation for Leo reflects an earlier version of Leo, but if you
>> are interested in using it with Java 8 and UIMA 2.8.1, we have not released
>> the latest version in on the VA github yet but we can share it with you so
>> that you can test it out and possibly provide your comments back to us.
>> 
>> Leo has simple-to-use functionality for flexible batch read and write and
>> it can work with any UIMA AEs and existing descriptor files and type system
>> descriptions, so if you already have a pipeline, wrapping it with Leo would
>> take just a few lines of code.
>> 
>> Let me know if you are interested and I can help you to get started.
>> 
>> Olga Patterson
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -Original Message-
>> From: Jaroslaw Cwiklik <uim...@gmail.com>
>> Reply-To: "user@uima.apache.org" <user@uima.apache.org>
>> Date: Friday, April 21, 2017 at 8:08 AM
>> To: "user@uima.apache.org" <user@uima.apache.org>
>> Subject: Re: Synchonizing Batches AE and StatusCallbackListener
>> 
>>Erik, thanks. This is more clear what you are trying to accomplish.
>> First,
>>there are no plans to retire the CPE. It is supported and I don't know
>> of
>>any plans to retire it. The only issue is ongoing development. My
>> efforts
>>are focused on extending and improving UIMA-AS.
>> 
>>I don't have an answer yet how to handle the CPE crash scenario with
>>respect to batching and subsequent restart from the last known good
>> batch.
>>Seems like some coordination would be needed to avoid redoing the whole
>>collection after a crash. Its been awhile since I've looked at the CPE.
>>Will take a look and see what is possible if anything.
>> 
>>There is another Apache UIMA project called DUCC which stands for
>>Distributed Uima Cluster Computing. From your email it looks like you
>> have
>>a cluster of machines available. Here is a quick description of DUCC:
>> 
>>DUCC is a Linux cluster controller designed to scale out any UIMA
>> pipeline
>>for high throughput collection processing jobs as well as for low
>> latency
>>real-tme applications. Building on UIMA-AS, DUCC is particularly well
>>suited to run large memory Java analytics in multiple threads in order
>> to
>>fully utilize mult

Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-21 Thread Eddie Epstein
Hi Erik,

A few words about DUCC and your application. DUCC is a cluster controller
that includes a resource manager and 3 applications: batch processing, long
running services and singleton processes.

The batch processing application consists of a users CollectionReader which
defines work items and a users aggregate for processing work items that can
be replicated as desired across the cluster of machines. DUCC manages the
remote process scale out and distribution of work items. The aggregate can
be vertically scaled within each process so that in-heap data can be shared
by multiple instances of the aggregate. UIMA-AS is not required for this
simple threading model.

For most applications a work item is itself a collection, a CAS containing
references to the data to be processed, where the collection size is
designed to have small enough granularity to support scale out but big
enough granularity to avoid bottlenecks.

The users aggregate normally has an initial CasMultiplier that reads the
input data and creates the CASes to be fed to the rest of the pipeline.
When all children CASes are finished processing the work item CAS is routed
to the aggregate's CasConsumer to finalize the collection. DUCC considers
the work item complete only when the work item CAS is successfully
processed.

The system is quite robust to errors: uncaught exceptions, analytics
crashing, machines crashing, etc.

Regards,
Eddie


On Fri, Apr 21, 2017 at 2:12 PM, Olga Patterson <olga.patter...@utah.edu>
wrote:

> Erik,
>
> My team at the VA have developed an easy way of implementing UIMA AS
> pipelines and scaling them to a large number of nodes - using Leo framework
> that extends UIMA AS 2.8.1. We have run pipelines on over 200M documents
> scaled across multiple nodes with dozens of service instances and it
> performs great.
>
> Here is some info:
> http://department-of-veterans-affairs.github.io/Leo/
>
> The documentation for Leo reflects an earlier version of Leo, but if you
> are interested in using it with Java 8 and UIMA 2.8.1, we have not released
> the latest version in on the VA github yet but we can share it with you so
> that you can test it out and possibly provide your comments back to us.
>
> Leo has simple-to-use functionality for flexible batch read and write and
> it can work with any UIMA AEs and existing descriptor files and type system
> descriptions, so if you already have a pipeline, wrapping it with Leo would
> take just a few lines of code.
>
> Let me know if you are interested and I can help you to get started.
>
> Olga Patterson
>
>
>
>
>
>
>
> -Original Message-
> From: Jaroslaw Cwiklik <uim...@gmail.com>
> Reply-To: "user@uima.apache.org" <user@uima.apache.org>
> Date: Friday, April 21, 2017 at 8:08 AM
> To: "user@uima.apache.org" <user@uima.apache.org>
> Subject: Re: Synchonizing Batches AE and StatusCallbackListener
>
> Erik, thanks. This is more clear what you are trying to accomplish.
> First,
> there are no plans to retire the CPE. It is supported and I don't know
> of
> any plans to retire it. The only issue is ongoing development. My
> efforts
> are focused on extending and improving UIMA-AS.
>
> I don't have an answer yet how to handle the CPE crash scenario with
> respect to batching and subsequent restart from the last known good
> batch.
> Seems like some coordination would be needed to avoid redoing the whole
> collection after a crash. Its been awhile since I've looked at the CPE.
> Will take a look and see what is possible if anything.
>
> There is another Apache UIMA project called DUCC which stands for
> Distributed Uima Cluster Computing. From your email it looks like you
> have
> a cluster of machines available. Here is a quick description of DUCC:
>
> DUCC is a Linux cluster controller designed to scale out any UIMA
> pipeline
> for high throughput collection processing jobs as well as for low
> latency
> real-tme applications. Building on UIMA-AS, DUCC is particularly well
> suited to run large memory Java analytics in multiple threads in order
> to
> fully utilize multicore machines. DUCC manages the life cycle of all
> processes deployed across the cluster, including non-UIMA processes
> such as
> tomcat servers or VNC sessions.
>
>  You can find more info on this here:
> https://uima.apache.org/doc-uimaducc-whatitam.html
>
> In UIMA-AS batching is an application concern. I am a bit fuzzy on
> implementation so perhaps someone else can comment how to implement
> batching and how to handle errors. You can use a CasMultipler and a
> custom
> FlowController to manage CASes and react to err

Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-21 Thread Olga Patterson
Erik,

My team at the VA have developed an easy way of implementing UIMA AS pipelines 
and scaling them to a large number of nodes - using Leo framework that extends 
UIMA AS 2.8.1. We have run pipelines on over 200M documents scaled across 
multiple nodes with dozens of service instances and it performs great.

Here is some info:
http://department-of-veterans-affairs.github.io/Leo/

The documentation for Leo reflects an earlier version of Leo, but if you are 
interested in using it with Java 8 and UIMA 2.8.1, we have not released the 
latest version in on the VA github yet but we can share it with you so that you 
can test it out and possibly provide your comments back to us.

Leo has simple-to-use functionality for flexible batch read and write and it 
can work with any UIMA AEs and existing descriptor files and type system 
descriptions, so if you already have a pipeline, wrapping it with Leo would 
take just a few lines of code.

Let me know if you are interested and I can help you to get started.

Olga Patterson 







-Original Message-
From: Jaroslaw Cwiklik <uim...@gmail.com>
Reply-To: "user@uima.apache.org" <user@uima.apache.org>
Date: Friday, April 21, 2017 at 8:08 AM
To: "user@uima.apache.org" <user@uima.apache.org>
Subject: Re: Synchonizing Batches AE and StatusCallbackListener

Erik, thanks. This is more clear what you are trying to accomplish. First,
there are no plans to retire the CPE. It is supported and I don't know of
any plans to retire it. The only issue is ongoing development. My efforts
are focused on extending and improving UIMA-AS.

I don't have an answer yet how to handle the CPE crash scenario with
respect to batching and subsequent restart from the last known good batch.
Seems like some coordination would be needed to avoid redoing the whole
collection after a crash. Its been awhile since I've looked at the CPE.
Will take a look and see what is possible if anything.

There is another Apache UIMA project called DUCC which stands for
Distributed Uima Cluster Computing. From your email it looks like you have
a cluster of machines available. Here is a quick description of DUCC:

DUCC is a Linux cluster controller designed to scale out any UIMA pipeline
for high throughput collection processing jobs as well as for low latency
real-tme applications. Building on UIMA-AS, DUCC is particularly well
suited to run large memory Java analytics in multiple threads in order to
fully utilize multicore machines. DUCC manages the life cycle of all
processes deployed across the cluster, including non-UIMA processes such as
tomcat servers or VNC sessions.

 You can find more info on this here:
https://uima.apache.org/doc-uimaducc-whatitam.html

In UIMA-AS batching is an application concern. I am a bit fuzzy on
implementation so perhaps someone else can comment how to implement
batching and how to handle errors. You can use a CasMultipler and a custom
FlowController to manage CASes and react to errors.The UIMA-AS service can
take an input CAS representing your batch, pass it on to the CasMultiplier,
generate CASes for each piece of work and deliver results to the
CasConsumer with a FlowController in the middle orchestrating the flow. I
defer to application deployment experts to provide you with more detail.

Jerry







On Fri, Apr 21, 2017 at 2:21 AM, Erik Fäßler <erik.faess...@uni-jena.de>
wrote:

> Hi Jerry,
>
> thanks a lot for your answer! I’m sorry that I didn’t make myself clearer.
> I will try again! :-)
> Here comes a lot of text, sorry for that. The post actually has two parts:
> The first explaining my issue, the second responding to the pointer to
> UIMA-AS.
>
> First: Yes, I use a CPE. I process text documents. Tens of millions of
> them.
> So, I have the following components to my issue, running with the CPE.
>
> 1. A CAS-Consumer (just an AnalysisEngine internally, of course). This
> consumer is responsible to serialise the document CAS into XMI and send 
the
> XMI to a database. It is a XMI-to-database consumer. For performance
> reasons, the XMI of multiple CASes is buffered and then sent as a batch,
> lets say, 50 CAS XMIs at a time.
> 2. A CPE StatusCallbackListener which also writes to the same database,
> but in another table. It logs into the database which documents have been
> successfully processed by the CPE. It also works on a batch basis.
>
> The goal: The CallbackListener should only mark those documents as
> successfully processed (i.e. as “finished”) where the CAS-Consumer 
actually
> has sent the XMI data to the database.
>
> Reason: I don’t want do

Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-21 Thread Jaroslaw Cwiklik
Erik, thanks. This is more clear what you are trying to accomplish. First,
there are no plans to retire the CPE. It is supported and I don't know of
any plans to retire it. The only issue is ongoing development. My efforts
are focused on extending and improving UIMA-AS.

I don't have an answer yet how to handle the CPE crash scenario with
respect to batching and subsequent restart from the last known good batch.
Seems like some coordination would be needed to avoid redoing the whole
collection after a crash. Its been awhile since I've looked at the CPE.
Will take a look and see what is possible if anything.

There is another Apache UIMA project called DUCC which stands for
Distributed Uima Cluster Computing. From your email it looks like you have
a cluster of machines available. Here is a quick description of DUCC:

DUCC is a Linux cluster controller designed to scale out any UIMA pipeline
for high throughput collection processing jobs as well as for low latency
real-tme applications. Building on UIMA-AS, DUCC is particularly well
suited to run large memory Java analytics in multiple threads in order to
fully utilize multicore machines. DUCC manages the life cycle of all
processes deployed across the cluster, including non-UIMA processes such as
tomcat servers or VNC sessions.

 You can find more info on this here:
https://uima.apache.org/doc-uimaducc-whatitam.html

In UIMA-AS batching is an application concern. I am a bit fuzzy on
implementation so perhaps someone else can comment how to implement
batching and how to handle errors. You can use a CasMultipler and a custom
FlowController to manage CASes and react to errors.The UIMA-AS service can
take an input CAS representing your batch, pass it on to the CasMultiplier,
generate CASes for each piece of work and deliver results to the
CasConsumer with a FlowController in the middle orchestrating the flow. I
defer to application deployment experts to provide you with more detail.

Jerry







On Fri, Apr 21, 2017 at 2:21 AM, Erik Fäßler 
wrote:

> Hi Jerry,
>
> thanks a lot for your answer! I’m sorry that I didn’t make myself clearer.
> I will try again! :-)
> Here comes a lot of text, sorry for that. The post actually has two parts:
> The first explaining my issue, the second responding to the pointer to
> UIMA-AS.
>
> First: Yes, I use a CPE. I process text documents. Tens of millions of
> them.
> So, I have the following components to my issue, running with the CPE.
>
> 1. A CAS-Consumer (just an AnalysisEngine internally, of course). This
> consumer is responsible to serialise the document CAS into XMI and send the
> XMI to a database. It is a XMI-to-database consumer. For performance
> reasons, the XMI of multiple CASes is buffered and then sent as a batch,
> lets say, 50 CAS XMIs at a time.
> 2. A CPE StatusCallbackListener which also writes to the same database,
> but in another table. It logs into the database which documents have been
> successfully processed by the CPE. It also works on a batch basis.
>
> The goal: The CallbackListener should only mark those documents as
> successfully processed (i.e. as “finished”) where the CAS-Consumer actually
> has sent the XMI data to the database.
>
> Reason: I don’t want documents marked as “finished” where the XMI data is
> not in the database but still in the CAS buffer. Because when now the
> pipeline crashes, the XMI data never gets sent to the database. Then, the
> processing state is inconsistent: Documents that have not been written into
> the database are marked as successfully processed. But their data is
> missing.
>
> Also, not each XMI data is stored. There is a condition in the consumer to
> decide whether the XMI is to be stored or not. Thus, I cannot “create
> consistency” by checking which XMI made it into the database.
>
> Is this better understandable?
>
>
>
> Regarding UIMA-AS:
>
> I tried it out a few years back when it was rather new, UIMA 2.3.1 or
> something. Back then, it was like the following:
> 1. Install a broker (or something - ActiveMQ was it called?)
> 2. Configure it.
> 3. Start it.
> 4. For each AE you want to use, deploy the AE on some server in your
> cluster (multiple AEs can be bundled into an AAE).
> 5. Start a reader process that will then fill the broker queue.
> 6. Wait until processing is finished.
> 7. Stop all the AE services deployed to the cluster, if you want to save
> the resources.
> 8. Stop the broker.
>
> This was quite a while back so perhaps this is not exactly how it was. But
> it seemed overly complex to me. I had to login into each server where I
> wanted work to be done. We have like 20 nodes or something. Perhaps I could
> write a script for that, but then I would have to keep track of the servers
> that are free to use at a current time. Because I am not the only one using
> the cluster.
> And then I have to stop all AE “services”. Until then, they will use
> memory because they just idle when there is nothing more to 

Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-21 Thread Erik Fäßler
Hi Jerry,

thanks a lot for your answer! I’m sorry that I didn’t make myself clearer. I 
will try again! :-)
Here comes a lot of text, sorry for that. The post actually has two parts: The 
first explaining my issue, the second responding to the pointer to UIMA-AS.

First: Yes, I use a CPE. I process text documents. Tens of millions of them.
So, I have the following components to my issue, running with the CPE.

1. A CAS-Consumer (just an AnalysisEngine internally, of course). This consumer 
is responsible to serialise the document CAS into XMI and send the XMI to a 
database. It is a XMI-to-database consumer. For performance reasons, the XMI of 
multiple CASes is buffered and then sent as a batch, lets say, 50 CAS XMIs at a 
time.
2. A CPE StatusCallbackListener which also writes to the same database, but in 
another table. It logs into the database which documents have been successfully 
processed by the CPE. It also works on a batch basis.

The goal: The CallbackListener should only mark those documents as successfully 
processed (i.e. as “finished”) where the CAS-Consumer actually has sent the XMI 
data to the database.

Reason: I don’t want documents marked as “finished” where the XMI data is not 
in the database but still in the CAS buffer. Because when now the pipeline 
crashes, the XMI data never gets sent to the database. Then, the processing 
state is inconsistent: Documents that have not been written into the database 
are marked as successfully processed. But their data is missing.

Also, not each XMI data is stored. There is a condition in the consumer to 
decide whether the XMI is to be stored or not. Thus, I cannot “create 
consistency” by checking which XMI made it into the database.

Is this better understandable?



Regarding UIMA-AS:

I tried it out a few years back when it was rather new, UIMA 2.3.1 or 
something. Back then, it was like the following:
1. Install a broker (or something - ActiveMQ was it called?)
2. Configure it.
3. Start it.
4. For each AE you want to use, deploy the AE on some server in your cluster 
(multiple AEs can be bundled into an AAE).
5. Start a reader process that will then fill the broker queue.
6. Wait until processing is finished.
7. Stop all the AE services deployed to the cluster, if you want to save the 
resources.
8. Stop the broker.

This was quite a while back so perhaps this is not exactly how it was. But it 
seemed overly complex to me. I had to login into each server where I wanted 
work to be done. We have like 20 nodes or something. Perhaps I could write a 
script for that, but then I would have to keep track of the servers that are 
free to use at a current time. Because I am not the only one using the cluster.
And then I have to stop all AE “services”. Until then, they will use memory 
because they just idle when there is nothing more to do.

In contrast, CPEs are self-contained projects in my case which I can distribute 
easily through our job system (SLURM).

I thought all the setup for UIMA-AS would pay out in better performance. But in 
my - admittedly limited - tests there was not much of a performance difference. 
CPEs seemed to be a bit faster due to the lack of CAS serialization between 
reader and AEs.

Of course, this was years in the past. Is the process a bit simpler today? Or 
perhaps I got it wrong to begin with, that’s possible. But I read the 
documentation back then and couldn’t see how to do things much simpler.

BUT if CPEs can’t solve my issue and UIMA-AS can, then perhaps I will try it 
again.

Another question: You said “CPE was replaced by UIMA-AS”. Does that mean that 
CPEs will eventually be removed from UIMA? Are they still a part of UIMA 3?

Sorry for all the text!

Best regards and thanks!

Erik

> On 20 Apr 2017, at 20:31, Jaroslaw Cwiklik  wrote:
> 
> Hi Erik, sorry for a delay responding to your question. This seems like a
> CPE question is this right? I am not quite following what is the issue you
> are running into. Could you explain this better? With a clearer problem
> description perhaps others will jump in with an answer  :)
> 
> Just FYI, the CPE was replaced by the UIMA-AS quite a long time ago.
> Perhaps UIMA-AS can work better for you. You can read about it here:
> https://uima.apache.org/d/uima-as-2.9.0/uima_async_scaleout.html
> 
> Jerry
> UIMA Team
> 
> On Tue, Apr 18, 2017 at 5:56 AM, Erik Fäßler 
> wrote:
> 
>> Hi all,
>> 
>> I have a use case where a consumer of mine sends CAS XMI data into a
>> database in batchProcessComplete(). I also use a StatusCallbackListener
>> that logs into the database whether a document has been completed
>> processing, this is also done batch wise.
>> Now the issue is, if the pipeline crashes for any reason, I must start
>> over because the “completion” flag from the CallbackListener and the data
>> actually sent by the XMI consumer is not synchronised, i.e. I don’t know if
>> the data has actually been sent for a document 

Re: Synchonizing Batches AE and StatusCallbackListener

2017-04-20 Thread Jaroslaw Cwiklik
Hi Erik, sorry for a delay responding to your question. This seems like a
CPE question is this right? I am not quite following what is the issue you
are running into. Could you explain this better? With a clearer problem
description perhaps others will jump in with an answer  :)

Just FYI, the CPE was replaced by the UIMA-AS quite a long time ago.
Perhaps UIMA-AS can work better for you. You can read about it here:
https://uima.apache.org/d/uima-as-2.9.0/uima_async_scaleout.html

Jerry
UIMA Team

On Tue, Apr 18, 2017 at 5:56 AM, Erik Fäßler 
wrote:

> Hi all,
>
> I have a use case where a consumer of mine sends CAS XMI data into a
> database in batchProcessComplete(). I also use a StatusCallbackListener
> that logs into the database whether a document has been completed
> processing, this is also done batch wise.
> Now the issue is, if the pipeline crashes for any reason, I must start
> over because the “completion” flag from the CallbackListener and the data
> actually sent by the XMI consumer is not synchronised, i.e. I don’t know if
> the data has actually been sent for a document that has completed
> processing because everything is done batch-wise and not immediately for
> performance reasons. I also cannot just look into the database which XMI
> data is there because it only gets sent on a met condition.
>
> I would like to somehow communicate between the consumer and the
> CallbackListener to send their data for the same documents in agreement. Is
> there anything I can do to achieve this?
>
> Best,
>
> Erik


Synchonizing Batches AE and StatusCallbackListener

2017-04-18 Thread Erik Fäßler
Hi all,

I have a use case where a consumer of mine sends CAS XMI data into a database 
in batchProcessComplete(). I also use a StatusCallbackListener that logs into 
the database whether a document has been completed processing, this is also 
done batch wise.
Now the issue is, if the pipeline crashes for any reason, I must start over 
because the “completion” flag from the CallbackListener and the data actually 
sent by the XMI consumer is not synchronised, i.e. I don’t know if the data has 
actually been sent for a document that has completed processing because 
everything is done batch-wise and not immediately for performance reasons. I 
also cannot just look into the database which XMI data is there because it only 
gets sent on a met condition.

I would like to somehow communicate between the consumer and the 
CallbackListener to send their data for the same documents in agreement. Is 
there anything I can do to achieve this?

Best,

Erik