[
https://issues.apache.org/jira/browse/CTAKES-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pei Chen updated CTAKES-189:
----------------------------
Fix Version/s: (was: 3.2.0)
future enhancement
> GSoC: Implement OCR/Tika to standardize text input for cTAKES
> -------------------------------------------------------------
>
> Key: CTAKES-189
> URL: https://issues.apache.org/jira/browse/CTAKES-189
> Project: cTAKES
> Issue Type: New Feature
> Affects Versions: 3.0-incubating
> Reporter: Pei Chen
> Labels: gsoc, gsoc2013
> Fix For: future enhancement
>
> Attachments: Gui.java
>
>
> I am proposing to have a component in cTAKES that is capable of taking in
> various types of content (PDF, Scanned JPG's, Word, XLS, TXT, etc.),
> extracting the text content before passing it on to cTAKES for NLP processing.
> There are currently open source libraries such as TIKA, JavaOCR as a starting
> point but I have not found a centralized lib that also incorporates all of
> the above including OCR into the flow easily.
--
This message was sent by Atlassian JIRA
(v6.2#6252)