Hi Eddie--


My experiences with UIMA AS are mostly with applications deployed
on a single cluster of multi-core machines interconnected with a high
performance network.

By "high performance" you mean something more than gigabit ethernet, like Infiniband or 10 GB optical fiber?

The largest cluster we have worked with is several
hundred nodes. We see hundreds of MB/sec of data flowing between
clients and services thru a single broker. The load is evenly distributed
among all instances of a service type. Client requests are processed
in the order they are queued.

I'm having trouble picturing this system landscape--could you describe how the various pieces of data (content, control messages, status messages, etc.) move through the system, from document source (or app) to result database ? I'd like to see where the network I/O is and where the disk I/O is, and what data formats are used.

By "broker" do you mean Active MQ?

How do clients submit requests to the cluster? Do you support non-Java clients? What does a request contain? Can the client monitor the progress of a request?

Is the broker a bottle-neck? Does all content pass through it? How many times does each document (in one form or another) pass through the broker?

How does a web crawler fit into the system?

Does one request have to completely finish before another can start? Are there priorities? What about requests from interactive application, where the user is waiting?

Given that document processing time varies significantly, and different requests may use different aggregate engines, how do you manage to keep all the CPUs equally (and hopefully fully) busy?

How does a client get the annotators that it needs deployed into the cluster?

Is every machine performing the same function, or do they specialize in a particular annotator? That is, is an aggregate engine self-contained in a single JVM, or is it split over multiple machines?

If a machine crashes, can there be data loss?  How do you recover?

Can you increase or decrease the capacity of the system without disrupting service?

So many questions, I know. But I think these are legitimate issues when building a system, and I don't see how AS handles them. Someone really needs to write a paper...

The strength of UIMA AS is to easily scale out pipelines that
exceed the processing resources of individual nodes with no changes to
annotator and flow controller code or descriptors. Achieving high
CPU utilization may require a bit of sophistication, as always, but
UIMA AS includes the tools to facilitate that process.

Really? To me AS seems more like a box of Legos and a picture (but no instructions) of a really cool airplane you can build if you've got the time and expertise.

Sorry about that. I'm just having a really hard time seeing how to build a reliable, scalable, efficient document processing service on AS. It's seems more theoretical than practical.


Greg Holmberg

Reply via email to