RE: [External Sender] Re: Runtime Parameters to Annotators Running as Services
Thanks - when you say having the client putting the data in the CAS do you mean: 1) Putting in the CollectionReader which the client is instantiating 2) Some other mechanism of putting data into the CAS I am not aware of I had been using 1), but in the processing of refactoring my CollectionReader I was trying to slim it down and just have it pass document identifiers to the aggregate analysis engine. I'm fuzzy on whether 2) is an option and if so how to implement. -John From: Eddie Epstein [eaepst...@gmail.com] Sent: Thursday, May 31, 2018 4:25 PM To: user@uima.apache.org Subject: [External Sender] Re: Runtime Parameters to Annotators Running as Services I may not understand the scenario. For meta-data that would modify the behavior of the analysis, for example changing what analysis is run for a CAS, putting it into the CAS itself is definitely recommended. The example above is for the UIMA service to access the artifact itself from a remote source (presumably because it is even less efficient for the remote client to put the data into the CAS). That is certainly recommended for high scale out of analysis services, assuming that the remote source can handle the load and not become a worse bottleneck than just having the client put the data into the CAS. Regards, Eddie On Tue, May 29, 2018 at 1:33 PM, Osborne, John D wrote: > What is the best practice for passing runtime meta-data about the analysis > to individual annotators when running UIMA-AS or UIMA-DUCC services? An > example would be a database identifier for an analysis of many documents. > I can't pass this in as parameters to the aggregate analysis engine running > as a service, because I don't know what that identifier is until runtime > (when the application calls the service). > > I used to put such information in the JCas, having the CollectionReader > implementation do all this work. But I am striving to have a more > lightweight CollectionReader... The application can obviously write > metadata to a database or other shared resource, but then it becomes > incumbent on the AnalysisEngine to access that shared resources over the > network (slow). > > Any advice appreciated, > > -John >
Runtime Parameters to Annotators Running as Services
What is the best practice for passing runtime meta-data about the analysis to individual annotators when running UIMA-AS or UIMA-DUCC services? An example would be a database identifier for an analysis of many documents. I can't pass this in as parameters to the aggregate analysis engine running as a service, because I don't know what that identifier is until runtime (when the application calls the service). I used to put such information in the JCas, having the CollectionReader implementation do all this work. But I am striving to have a more lightweight CollectionReader... The application can obviously write metadata to a database or other shared resource, but then it becomes incumbent on the AnalysisEngine to access that shared resources over the network (slow). Any advice appreciated, -John
Deploy Async Service without XML
Is it possible to deploy an UIMA-AS service without an XML descriptor similar to how UIMA-FIT works? I currently deploy services using deployAsyncService.sh I have multiple long running services that need to work in different (production, testing, dev) environments and would prefer to avoid having an XML file for each service. I realize that with some refactoring (like removing environment specific parameters) this number of XML files could be reduced, but I've become spoiled with UIMA-FIT. :) I'm looking at the toXML() function so I can potentially generate the aggregate analysis engine with UIMA-FIT. -John
RE: UIMA analysis from a database
Thanks Richard and Nicholas, Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ? It's also doing UIMA on Spark - I'm wondering if you are aware of it and how it is different from your own project? Thanks for any info, -John From: Richard Eckart de Castilho [r...@apache.org] Sent: Friday, September 15, 2017 5:29 AM To: user@uima.apache.org Subject: Re: UIMA analysis from a database On 15.09.2017, at 09:28, Nicolas Pariswrote: > > - UIMA-AS is another way to program UIMA Here you probably meant uimaFIT. > - UIMA-FIT is complicated > - UIMA-FIT only work with UIMA ... and I suppose you mean UIMA-AS here. > - UIMA only focuses on text Annotation Yep. Although it has also been used for other media, e.g. video and audio. But the core UIMA framework doesn't specifically consider these media. People who apply it UIMA in the context of other media do so with custom type systems. > - UIMA is not good at: > - text transformation It is not straight-forward but possible. E.g. the text normalizers in DKPro Core make use of either different views for different states of normalization or drop the original text and forward the normalized text within a pipeline by means of a CAS multiplier. > - read data from source in parallel > - write data to folder in parallel Not sure if these two are limitations of the framework rather than of the way that you use readers and writers in the particular scale-out mode you are working with. > - machine learning interface UIMA doesn't offer ML as part of the core framework because that is simply not within the scope of what the UIMA framework aims to achieve. There are various people who have built ML around UIMA, e.g. ClearTK (https://urldefense.proofpoint.com/v2/url?u=http-3A__cleartk.github.io_cleartk_=DwICAw=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=tAU9eh1Sq_D-L1P4GfuME4SQleRf9q_7Ll9siim5W0c=J1-BGfzlrX9t3-Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM= ) or DKPro TC (https://urldefense.proofpoint.com/v2/url?u=https-3A__dkpro.github.io_dkpro-2Dtc_=DwICAw=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=tAU9eh1Sq_D-L1P4GfuME4SQleRf9q_7Ll9siim5W0c=kye5D2izwKE_9V2QQW8leiKp0p-91U-CFwXJMFmCd3w= ) - and as you did, it can be combined in various ways with ML frameworks that specialize specifically on ML. Cheers, -- Richard
RE: UIMA analysis from a database
Hi Nicolas, I'm curious, why did you decide to use Spark over UIMA-AS or DUCC? Is it because you are more familiar with Spark or were their other reasons? I have been using UIMA-AS, I am currently experimenting with DUCC and would love to hear your thoughts on the matter. -John From: Nicolas Paris [nipari...@gmail.com] Sent: Thursday, September 14, 2017 5:32 PM To: user@uima.apache.org Subject: Re: UIMA analysis from a database Hi Benedict Not sure this is helpful for you, but only an advice. I recommend usint UIMA for what it is first intended : nlp pipeline. When dealing with multi threaded application, I would go for dedicated technologies. I have been successfuly using UIMA together with apache spark. While this design works well on a single computer, I am now able to distribute UIMA pipeline over dosen of them, withou extra need. Then I focus on UIMA pipeline doing it's job well, and after testing, industrialize them over spark. Advantages of this design are: - benefit from spark distributing expertise (note failure, memory consumption, data partitionning...) - simplify UIMA programming (no multithread inside, only NLP stuff) - scale when needed (add more chip computer, get better performances) - get expertise with spark, and use it with any java code you d'like - spark do have JDBC connectors and may be able to fetch data in multithread easily. you can have an wotking example in my repo https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_parisni_UimaOnSpark=DwIDAw=o3PTkfaYAd6-No7SurnLtwPssd47t-De9Do23lQNz7U=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo=9w0-CmPPbyYElPML1EOD_jqp84ZXz2rpRpEFsYxecTY=RQJDDNPq5uLPH4q0rY6tvPy_CxvFLjUCkLqpnPCeSgU= This have not been simple to make it working, but I can tell know this methods is robust and optimized. Le 14 sept. 2017 à 21:24, Benedict Holland écrivait : > Hello everyone, > > I am trying to get my project off the ground and hit a small problem. > > I want to read text from a large database (lets say, 100,000+ rows). Each > row will have a text article. I want to connect to the database, request a > single row from the database, and process this document through an NLP > engine and I want to do this in parallel. Each document will be say, split > up into sentences and each sentence will be POS tagged. > > After reading the documentation, I am more confused than when I started. I > think I want something like the FileSystemCollectionReader example and > build a CPE. Instead of reading from the file system, it will read from the > database. > > There are two problems with this approach: > > 1. I am not sure it is multi threaded: CAS initializers are deprecated and > it appears that the getNext() method will only run in a single thread. > 2. The FileSystemCollectionReader loads references to the file location > into memory but not the text itself. > > For problem 1, the line I find very troubling is > > File file = (File) mFiles.get(mCurrentIndex++); > > I have to assume from this line that the CollectionReader_ImplBase is not > multi-threaded but is intended to rapidly iterate over a set of documents > in a single thread. > > Problem 2 is easily solved as I can create a massive array of integers if I > feel like. > > Anyway, after deciding that this is not likely the solution, I looked into > Multi-view Sofa annotators. I don't think these do what I want either. In > this context, I would treat the database table as a single object with many > "views" being chunks of rows. I don't think this works, based on the > SofaExampleAnnotator code provided. It also appears to run in a single > thread. > > This leaves me with CAS pools. I know that this is doing to be > multi-threaded. I believe I create however many CAS objects from the > annotator I want, probably an aggregate annotator. Is this correct and am I > on the right track with CAS Pools? > > Thank you so much, > ~Ben