Selina Chu created CTAKES-374:
---------------------------------
Summary: Scale out of cTAKES pipeline. Finding better ways to
allow cTAKES to be easily run in a distributed fashion.
Key: CTAKES-374
URL: https://issues.apache.org/jira/browse/CTAKES-374
Project: cTAKES
Issue Type: New Feature
Affects Versions: future enhancement
Reporter: Selina Chu
Fix For: 3.2.1
Currently, cTAKES can't be easily deployed in an asynchronous manner. UIMA
components aren't serializable (and thus cTAKES' components as well). Would
like to come up with better ways to allow cTAKES to be easily run in a
distributed fashion.
For example, for processing a long document (e.g. 10+ pages), cTAKES would take
a long time to process.
I would like to see a feature where we can partition the input to cTAKES, in a
way that won't affect the cTAKES annotation performance, allowing us to process
through a cluster running in distributed mode (e.g. Spark streaming cTAKES).
And then recombine the results such that the word/phrase token positions will
be sequentially ordered.
We have a simple implementation of the ClinicalPipelineFactory with Spark
Streaming. Currently our initial attempt in partitioning is by paragraphs. For
example, we are doing something like:
RDD.map(a_single_paragraph.process_in_ctakes())
I also wanted to see if there are any better ways of doing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)