Re: Scale out using multiple Collection Readers and Cas Consumers

Greg Holmberg Wed, 01 Dec 2010 22:10:27 -0800

Hi Eddie--

My experiences with UIMA AS are mostly with applications deployed
on a single cluster of multi-core machines interconnected with a high
performance network.

By "high performance" you mean something more than gigabit ethernet, likeInfiniband or 10 GB optical fiber?

The largest cluster we have worked with is several
hundred nodes. We see hundreds of MB/sec of data flowing between
clients and services thru a single broker. The load is evenly distributed
among all instances of a service type. Client requests are processed
in the order they are queued.

I'm having trouble picturing this system landscape--could you describe howthe various pieces of data (content, control messages, status messages,etc.) move through the system, from document source (or app) to resultdatabase ? I'd like to see where the network I/O is and where the diskI/O is, and what data formats are used.


By "broker" do you mean Active MQ?

How do clients submit requests to the cluster? Do you support non-Javaclients? What does a request contain? Can the client monitor theprogress of a request?

Is the broker a bottle-neck? Does all content pass through it? How manytimes does each document (in one form or another) pass through the broker?


How does a web crawler fit into the system?

Does one request have to completely finish before another can start? Arethere priorities? What about requests from interactive application, wherethe user is waiting?

Given that document processing time varies significantly, and differentrequests may use different aggregate engines, how do you manage to keepall the CPUs equally (and hopefully fully) busy?

How does a client get the annotators that it needs deployed into thecluster?

Is every machine performing the same function, or do they specialize in aparticular annotator? That is, is an aggregate engine self-contained in asingle JVM, or is it split over multiple machines?


If a machine crashes, can there be data loss?  How do you recover?

Can you increase or decrease the capacity of the system without disruptingservice?

So many questions, I know. But I think these are legitimate issues whenbuilding a system, and I don't see how AS handles them. Someone reallyneeds to write a paper...

The strength of UIMA AS is to easily scale out pipelines that
exceed the processing resources of individual nodes with no changes to
annotator and flow controller code or descriptors. Achieving high
CPU utilization may require a bit of sophistication, as always, but
UIMA AS includes the tools to facilitate that process.

Really? To me AS seems more like a box of Legos and a picture (but noinstructions) of a really cool airplane you can build if you've got thetime and expertise.

Sorry about that. I'm just having a really hard time seeing how to builda reliable, scalable, efficient document processing service on AS. It'sseems more theoretical than practical.



Greg Holmberg

Re: Scale out using multiple Collection Readers and Cas Consumers

Reply via email to