[
https://issues.apache.org/jira/browse/CTAKES-374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Selina Chu updated CTAKES-374:
------------------------------
Summary: Scaleout of cTAKES pipeline (was: Scale out of cTAKES pipeline)
> Scaleout of cTAKES pipeline
> ---------------------------
>
> Key: CTAKES-374
> URL: https://issues.apache.org/jira/browse/CTAKES-374
> Project: cTAKES
> Issue Type: New Feature
> Affects Versions: future enhancement
> Reporter: Selina Chu
> Fix For: 3.2.1
>
>
> Currently, cTAKES can't be easily deployed in an asynchronous manner. UIMA
> components aren't serializable (and thus cTAKES' components as well). Would
> like to come up with better ways to allow cTAKES to be easily run in a
> distributed fashion.
> For example, for processing a long document (e.g. 10+ pages), cTAKES would
> take a long time to process.
> I would like to see a feature where we can partition the input to cTAKES, in
> a way that won't affect the cTAKES annotation performance, allowing us to
> process through a cluster running in distributed mode (e.g. Spark streaming
> cTAKES). And then recombine the results such that the word/phrase token
> positions will be sequentially ordered.
> We have a simple implementation of the ClinicalPipelineFactory with Spark
> Streaming. Currently our initial attempt in partitioning is by paragraphs.
> For example, we are doing something like:
> RDD.map(a_single_paragraph.process_in_ctakes())
> I also wanted to see if there are any better ways of doing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)