On Thu, Nov 20, 2014 at 10:01 PM, D. Heinze <dhei...@gnoetics.com> wrote:
> Eddie... thanks. Yes, that sounds like I would not have the advantage of > DUCC managing the UIMA pipeline. > Depends on the definition of "managing". DUCC manages the lifecycle of analytic pipelines running as job processes and as services. There are differences in how DUCC decides how many instances of each are run. And you are right that only for jobs will DUCC send work items to the analytic pipeline. > > To break it down a little for the uninitiated (me), > > 1. how do I start a DUCC job that stays resident because it has high > startup cost (e.g. 2 minutes to load all the resources for the UIMA > pipeline VS about 2 seconds to process each request)? > Run the pipeline as a service. A service can be configured to start automatically, as soon as DUCC starts. If the load on the service increases, DUCC can be told [manually or programmatically] to launch additional service instances. > 2. once I have a resident job, how do I get the Job Driver to iteratively > feed references to each next document (as they are received) to the > resident Job Process? Because all the input jobs will be archived anyhow, > I'm okay with passing them through the file system if needed. > The easiest approach is to have an application driver, say a web service, directly feed input to the service. If using references as input, the same analytic pipeline could be used both for live processing as a service and for batch job processing. DUCC jobs are designed for batch work, where the size of the input collection is known and the number of job processes will be replicated as much as possible, given available resources and the job's fair share when multiple jobs are running. DUCC services are intended to support job pipelines, for example a large memory but low latency analytic that can be shared by many job process instances, or for interactive applications. Have you looked at creating a UIMA-AS service from a UIMA pipeline? Eddie