RE: [External Sender] Re: Runtime Parameters to Annotators Running as Services

2018-06-01 Thread Osborne, John D
Thanks - when you say having the client putting the data in the CAS do you mean:

1) Putting in the CollectionReader which the client is instantiating
2) Some other mechanism of putting data into the CAS I am not aware of

I had been using 1), but in the processing of refactoring my CollectionReader I 
was trying to slim it down and just have it pass document identifiers to the 
aggregate analysis engine. I'm fuzzy on whether 2) is an option and if so how 
to implement.

 -John



From: Eddie Epstein [eaepst...@gmail.com]
Sent: Thursday, May 31, 2018 4:25 PM
To: user@uima.apache.org
Subject: [External Sender] Re: Runtime Parameters to Annotators Running as 
Services

I may not understand the scenario.

For meta-data that would modify the behavior of the analysis, for example
changing what analysis is run for a  CAS, putting it into the CAS itself is
definitely recommended.

The example above is for the UIMA service to access the artifact itself
from a remote source (presumably because it is even less efficient for the
remote client to put the data into the CAS). That is certainly recommended
for high scale out of analysis services, assuming that the remote source
can handle the load and not become a worse bottleneck than just having the
client put the data into the CAS.

Regards,
Eddie

On Tue, May 29, 2018 at 1:33 PM, Osborne, John D  wrote:

> What is the best practice for passing runtime meta-data about the analysis
> to individual annotators when running UIMA-AS or UIMA-DUCC services? An
> example would be  a database identifier for an analysis of many documents.
> I can't pass this in as parameters to the aggregate analysis engine running
> as a service, because I don't know what that identifier is until runtime
> (when the application calls the service).
>
> I used to put such information in the JCas, having the CollectionReader
> implementation do all this work. But I am striving to have a more
> lightweight CollectionReader... The application can obviously write
> metadata to a database or other shared resource, but then it becomes
> incumbent on the AnalysisEngine to access that shared resources over the
> network (slow).
>
> Any advice appreciated,
>
>  -John
>


Runtime Parameters to Annotators Running as Services

2018-05-29 Thread Osborne, John D
What is the best practice for passing runtime meta-data about the analysis to 
individual annotators when running UIMA-AS or UIMA-DUCC services? An example 
would be  a database identifier for an analysis of many documents. I can't pass 
this in as parameters to the aggregate analysis engine running as a service, 
because I don't know what that identifier is until runtime (when the 
application calls the service).

I used to put such information in the JCas, having the CollectionReader 
implementation do all this work. But I am striving to have a more lightweight 
CollectionReader... The application can obviously write metadata to a database 
or other shared resource, but then it becomes incumbent on the AnalysisEngine 
to access that shared resources over the network (slow).

Any advice appreciated,

 -John


Deploy Async Service without XML

2018-05-14 Thread Osborne, John D
Is it possible to deploy an UIMA-AS service without an XML descriptor similar 
to how UIMA-FIT works? I currently deploy services using deployAsyncService.sh

I have multiple long running services that need to work in different 
(production, testing, dev) environments and would prefer to avoid having an XML 
file for each service. I realize that with some refactoring (like removing 
environment specific parameters) this number of XML files could be reduced, but 
I've become spoiled with UIMA-FIT. :)

I'm looking at the toXML()  function so I can potentially generate the 
aggregate analysis engine with UIMA-FIT.

-John



RE: UIMA analysis from a database

2017-09-15 Thread Osborne, John D
Thanks Richard and Nicholas,

Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?

It's also doing UIMA on Spark - I'm wondering if you are aware of it and how it 
is different from your own project?

Thanks for any info,

 -John



From: Richard Eckart de Castilho [r...@apache.org]
Sent: Friday, September 15, 2017 5:29 AM
To: user@uima.apache.org
Subject: Re: UIMA analysis from a database

On 15.09.2017, at 09:28, Nicolas Paris  wrote:
>
> - UIMA-AS is another way to program UIMA

Here you probably meant uimaFIT.

> - UIMA-FIT is complicated
> - UIMA-FIT only work with UIMA

... and I suppose you mean UIMA-AS here.

> - UIMA only focuses on text Annotation

Yep. Although it has also been used for other media, e.g. video and audio.
But the core UIMA framework doesn't specifically consider these media.
People who apply it UIMA in the context of other media do so with custom
type systems.

> - UIMA is not good at:
>   - text transformation

It is not straight-forward but possible. E.g. the text normalizers in
DKPro Core make use of either different views for different states of
normalization or drop the original text and forward the normalized
text within a pipeline by means of a CAS multiplier.

>   - read data from source in parallel
>   - write data to folder in parallel

Not sure if these two are limitations of the framework
rather than of the way that you use readers and writers
in the particular scale-out mode you are working with.

>   - machine learning interface

UIMA doesn't offer ML as part of the core framework because
that is simply not within the scope of what the UIMA framework
aims to achieve.

There are various people who have built ML around UIMA, e.g.
ClearTK 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__cleartk.github.io_cleartk_=DwICAw=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=tAU9eh1Sq_D-L1P4GfuME4SQleRf9q_7Ll9siim5W0c=J1-BGfzlrX9t3-Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM=
 ) or DKPro TC
(https://urldefense.proofpoint.com/v2/url?u=https-3A__dkpro.github.io_dkpro-2Dtc_=DwICAw=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=tAU9eh1Sq_D-L1P4GfuME4SQleRf9q_7Ll9siim5W0c=kye5D2izwKE_9V2QQW8leiKp0p-91U-CFwXJMFmCd3w=
 ) - and as you did, it
can be combined in various ways with ML frameworks that
specialize specifically on ML.


Cheers,

-- Richard




RE: UIMA analysis from a database

2017-09-14 Thread Osborne, John D
Hi Nicolas,

I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it 
because you are more familiar with Spark or were their other reasons?

I have been using UIMA-AS, I am currently experimenting with DUCC and would 
love to hear your thoughts on the matter.

 -John



From: Nicolas Paris [nipari...@gmail.com]
Sent: Thursday, September 14, 2017 5:32 PM
To: user@uima.apache.org
Subject: Re: UIMA analysis from a database

Hi Benedict

Not sure this is helpful for you, but only an advice.
I recommend usint UIMA for what it is first intended : nlp pipeline.

When dealing with multi threaded application, I would go for dedicated
technologies.

I have been successfuly using UIMA together with apache spark. While this
design works well on a single computer, I am now able to distribute UIMA
pipeline over dosen of them, withou extra need.

Then I focus on UIMA pipeline doing it's job well, and after testing,
industrialize them over spark.

Advantages of this design are:
- benefit from spark distributing expertise (note failure, memory
  consumption, data partitionning...)
- simplify UIMA programming (no multithread inside, only NLP stuff)
- scale when needed (add more chip computer, get better performances)
- get expertise with spark, and use it with any java code you d'like
- spark do have JDBC connectors and may be able to fetch data in
  multithread easily.

you can have an wotking example in my repo 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark=DwIDAw=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU=
This have not been simple to make it working, but I can tell know this
methods is robust and optimized.


Le 14 sept. 2017 à 21:24, Benedict Holland écrivait :
> Hello everyone,
>
> I am trying to get my project off the ground and hit a small problem.
>
> I want to read text from a large database (lets say, 100,000+ rows). Each
> row will have a text article. I want to connect to the database, request a
> single row from the database, and process this document through an NLP
> engine and I want to do this in parallel. Each document will be say, split
> up into sentences and each sentence will be POS tagged.
>
> After reading the documentation, I am more confused than when I started. I
> think I want something like the FileSystemCollectionReader example and
> build a CPE. Instead of reading from the file system, it will read from the
> database.
>
> There are two problems with this approach:
>
> 1. I am not sure it is multi threaded: CAS initializers are deprecated and
> it appears that the getNext() method will only run in a single thread.
> 2. The FileSystemCollectionReader loads references to the file location
> into memory but not the text itself.
>
> For problem 1, the line I find very troubling is
>
> File file = (File) mFiles.get(mCurrentIndex++);
>
> I have to assume from this line that the CollectionReader_ImplBase is not
> multi-threaded but is intended to rapidly iterate over a set of documents
> in a single thread.
>
> Problem 2 is easily solved as I can create a massive array of integers if I
> feel like.
>
> Anyway, after deciding that this is not likely the solution, I looked into
> Multi-view Sofa annotators. I don't think these do what I want either. In
> this context, I would treat the database table as a single object with many
> "views" being chunks of rows. I don't think this works, based on the
> SofaExampleAnnotator code provided. It also appears to run in a single
> thread.
>
> This leaves me with CAS pools. I know that this is doing to be
> multi-threaded. I believe I create however many CAS objects from the
> annotator I want, probably an aggregate annotator. Is this correct and am I
> on the right track with CAS Pools?
>
> Thank you so much,
> ~Ben