Another thanks to all contributing to this thread.

We¹re looking to transition a NLP large application processing ~30TB/month
from a custom NLP framework to UIMA-AS, and from parallel processing on a
dedicated cluster with custom python scripts which call gnu parallel, to
something with better support for managing resources on a shared cluster.

Both our internal IT/engineering group and our cluster vendor
(HortonWorks) use and support Hadoop/Spark/YARN on a new shared cluster.
DUCC¹s capabilities seem to overlap with these more general purpose tools.
 Although it may be more closely aligned with UIMA for a dedicated
cluster, I think the big question for us would be how/whether it would
play nicely with other Hadoop/Spark/YARN jobs on the shared cluster.
We¹re also likely to move at least some of our workload to a cloud
computing host, and it seems like Hadoop/Spark are much more likely to be
supported there.

David Fox

On 9/15/17, 1:57 PM, "Eddie Epstein" <eaepst...@gmail.com> wrote:

>There are a few DUCC features that might be of particular interest for
>scaling out UIMA analytics.
>
> - all user code for batch processing continues to use the existing UIMA
>component model: collection readers, cas multiplers, analysis engines, and
>cas consumers.**
>
> - DUCC supports assembling and debugging a single threaded process with
>these components, and then with no code change launch a highly scaled out
>deployment.
>
> - for applications that use too much RAM to be able to utilize all the
>cores on worker machines, DUCC can do the vertical (thread) scaleout
>needed
>to share memory.
>
> - DUCC automatically captures the performance breakdown of the UIMA-based
>processes, as well as capturing process statistics including CPU, RAM,
>swap, pagefaults and GC. Performance breakdown info for individual tasks
>(DUCC work items) can optionally be captured.
>
> - DUCC has extensive error handling, automatically resubmitting work
>associated with uncaught exceptions, process crashes, machine failures,
>network failures, etc.
>
> - Exceptions are convenient to get to, and an attempt is made to make
>obvious things that might be tricky to find, such all the reasons a
>process
>might fail to start, without having to dig through DUCC framework logs.
>
>** DUCC services introduce a new user programmable component, a service
>pinger, that is responsible for validating that a service is operating
>correctly. The service pinger can also dynamically change the number of
>instances of a service, and it can restart individual instances that are
>determined to be acting badly.
>
>Eddie
>
>On Fri, Sep 15, 2017 at 10:32 AM, Osborne, John D <josbo...@uabmc.edu>
>wrote:
>
>> Thanks Richard and Nicholas,
>>
>> Nicholas - have you looked at SUIM (https://github.com/oaqa/suim) ?
>>
>> It's also doing UIMA on Spark - I'm wondering if you are aware of it and
>> how it is different from your own project?
>>
>> Thanks for any info,
>>
>>  -John
>>
>>
>> ________________________________________
>> From: Richard Eckart de Castilho [r...@apache.org]
>> Sent: Friday, September 15, 2017 5:29 AM
>> To: user@uima.apache.org
>> Subject: Re: UIMA analysis from a database
>>
>> On 15.09.2017, at 09:28, Nicolas Paris <nipari...@gmail.com> wrote:
>> >
>> > - UIMA-AS is another way to program UIMA
>>
>> Here you probably meant uimaFIT.
>>
>> > - UIMA-FIT is complicated
>> > - UIMA-FIT only work with UIMA
>>
>> ... and I suppose you mean UIMA-AS here.
>>
>> > - UIMA only focuses on text Annotation
>>
>> Yep. Although it has also been used for other media, e.g. video and
>>audio.
>> But the core UIMA framework doesn't specifically consider these media.
>> People who apply it UIMA in the context of other media do so with custom
>> type systems.
>>
>> > - UIMA is not good at:
>> >       - text transformation
>>
>> It is not straight-forward but possible. E.g. the text normalizers in
>> DKPro Core make use of either different views for different states of
>> normalization or drop the original text and forward the normalized
>> text within a pipeline by means of a CAS multiplier.
>>
>> >       - read data from source in parallel
>> >       - write data to folder in parallel
>>
>> Not sure if these two are limitations of the framework
>> rather than of the way that you use readers and writers
>> in the particular scale-out mode you are working with.
>>
>> >       - machine learning interface
>>
>> UIMA doesn't offer ML as part of the core framework because
>> that is simply not within the scope of what the UIMA framework
>> aims to achieve.
>>
>> There are various people who have built ML around UIMA, e.g.
>> ClearTK (https://urldefense.proofpoint.com/v2/url?u=http-
>> 3A__cleartk.github.io_cleartk_&d=DwICAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-
>> 
>>De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=tAU9eh1Sq_D
>>-
>> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=J1-BGfzlrX9t3-
>> Vg5K7mAVBHQSb7M5PAbTYIJoh6sOM&e= ) or DKPro TC
>> (https://urldefense.proofpoint.com/v2/url?u=https-
>> 
>>3A__dkpro.github.io_dkpro-2Dtc_&d=DwICAw&c=o3PTkfaYAd6-No7SurnLtwPssd47t-
>> 
>>De9Do23lQNz7U&r=SEpLmXf_P21h_X0qEQSssKMDDEOsGxxYoSxofi_ZbFo&m=tAU9eh1Sq_D
>>-
>> L1P4GfuME4SQleRf9q_7Ll9siim5W0c&s=kye5D2izwKE_9V2QQW8leiKp0p-91U-
>> CFwXJMFmCd3w&e= ) - and as you did, it
>> can be combined in various ways with ML frameworks that
>> specialize specifically on ML.
>>
>>
>> Cheers,
>>
>> -- Richard
>>
>>
>>

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

Reply via email to