Arun--

I don't know what the cause of your specific technical issue is, but in my opinion, there's a better way to slice the problem.

What you're doing is taking each step in your analysis engine and running it on one or more machines. The creates two problems.

One, it's a lot of network overhead. You're moving each document across the network many times. You can easily spend more time just moving the data around than actually processing. It also creates a low ceiling to scalability, since you chew up a lot of network bandwidth.

Two, in order to use your hardware efficiently, you have to get the right ratio of machines/CPUs for each step. Some steps use more cycles than others. For example, you might find that for a given configuration and set of documents that the ratio of CPU usage for steps A, B, and C are 1:5:2. Now you need to instantiate A, B, and C services to use cores in that ratio. Then, suppose you want to add more machines--how should you allocate them to A, B, and C? It will always be lumpy, with some cores not being used much. But worse, with a different configuration (different dictionaries, for example), or with different documents (longer vs. shorter, for example), the ratios will change, and you will have to reconfigure your machines again. It's never-ending, and it's never completely right.

So, it would be much easier to manage and more efficient, more scalable, if you just run your analysis engine self-contained in a single process, and then replicate the engine over your machines/CPUs. You slice by document, not by service--send each document to a different analysis engine instance. This makes your life easier, always runs the CPUs at 100%, and scales indefinitely. Just add more machines, it goes faster.

This is what I'm doing. I use JavaSpaces (producer/consumer queue), but I'm sure you can get the same effect with UIMA AS and ActiveMQ.


Greg

Reply via email to