Another thanks to all contributing to this thread. We¹re looking to transition a NLP large application processing ~30TB/month from a custom NLP framework to UIMA-AS, and from parallel processing on a dedicated cluster with custom python scripts which call gnu parallel, to something with better support for managing resources on a shared cluster.
Both our internal IT/engineering group and our cluster vendor (HortonWorks) use and support Hadoop/Spark/YARN on a new shared cluster. DUCC¹s capabilities seem to overlap with these more general purpose tools. Although it may be more closely aligned with UIMA for a dedicated cluster, I think the big question for us would be how/whether it would play nicely with other Hadoop/Spark/YARN jobs on the shared cluster. We¹re also likely to move at least some of our workload to a cloud computing host, and it seems like Hadoop/Spark are much more likely to be supported there. David Fox On 9/15/17, 1:57 PM, "Eddie Epstein" <eaepst...@gmail.com> wrote: >There are a few DUCC features that might be of particular interest for >scaling out UIMA analytics. > > - all user code for batch processing continues to use the existing UIMA >component model: collection readers, cas multiplers, analysis engines, and >cas consumers.** > > - DUCC supports assembling and debugging a single threaded process with >these components, and then with no code change launch a highly scaled out >deployment. > > - for applications that use too much RAM to be able to utilize all the >cores on worker machines, DUCC can do the vertical (thread) scaleout >needed >to share memory. > > - DUCC automatically captures the performance breakdown of the UIMA-based >processes, as well as capturing process statistics including CPU, RAM, >swap, pagefaults and GC. Performance breakdown info for individual tasks >(DUCC work items) can optionally be captured. > > - DUCC has extensive error handling, automatically resubmitting work >associated with uncaught exceptions, process crashes, machine failures, >network failures, etc. > > - Exceptions are convenient to get to, and an attempt is made to make >obvious things that might be tricky to find, such all the reasons a >process >might fail to start, without having to dig through DUCC framework logs. > >** DUCC services introduce a new user programmable component, a service >pinger, that is responsible for validating that a service is operating >correctly. The service pinger can also dynamically change the number of >instances of a service, and it can restart individual instances that are >determined to be acting badly. > >Eddie > >On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D <josbo...@uabmc.edu> >wrote: > >> Thanks Richard and Nicholas, >> >> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ? >> >> It's also doing UIMA on Spark - I'm wondering if you are aware of it and >> how it is different from your own project? >> >> Thanks for any info, >> >> -John >> >> >> ________________________________________ >> From: Richard Eckart de Castilho [r...@apache.org] >> Sent: Friday, September 15, 2017 5:29 AM >> To: user@uima.apache.org >> Subject: Re: UIMA analysis from a database >> >> On 15.09.2017, at 09:28, Nicolas Paris <nipari...@gmail.com> wrote: >> > >> > - UIMA-AS is another way to program UIMA >> >> Here you probably meant uimaFIT. >> >> > - UIMA-FIT is complicated >> > - UIMA-FIT only work with UIMA >> >> ... and I suppose you mean UIMA-AS here. >> >> > - UIMA only focuses on text Annotation >> >> Yep. Although it has also been used for other media, e.g. video and >>audio. >> But the core UIMA framework doesn't specifically consider these media. >> People who apply it UIMA in the context of other media do so with custom >> type systems. >> >> > - UIMA is not good at: >> > - text transformation >> >> It is not straight-forward but possible. E.g. the text normalizers in >> DKPro Core make use of either different views for different states of >> normalization or drop the original text and forward the normalized >> text within a pipeline by means of a CAS multiplier. >> >> > - read data from source in parallel >> > - write data to folder in parallel >> >> Not sure if these two are limitations of the framework >> rather than of the way that you use readers and writers >> in the particular scale-out mode you are working with. >> >> > - machine learning interface >> >> UIMA doesn't offer ML as part of the core framework because >> that is simply not within the scope of what the UIMA framework >> aims to achieve. >> >> There are various people who have built ML around UIMA, e.g. >> ClearTK (https://urldefense.proofpoint.com/v2/url?u=http- >> 3A__cleartk.github.io_cleartk_&d=DwICAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t- >> >>De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=tAU9eh1Sq_D >>- >> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=J1-BGfzlrX9t3- >> Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM&e= ) or DKPro TC >> (https://urldefense.proofpoint.com/v2/url?u=https- >> >>3A__dkpro.github.io_dkpro-2Dtc_&d=DwICAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t- >> >>De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=tAU9eh1Sq_D >>- >> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=kye5D2izwKE_9V2QQW8leiKp0p-91U- >> CFwXJMFmCd3w&e= ) - and as you did, it >> can be combined in various ways with ML frameworks that >> specialize specifically on ML. >> >> >> Cheers, >> >> -- Richard >> >> >> This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.