[
https://issues.apache.org/jira/browse/CTAKES-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pei Chen updated CTAKES-189:
----------------------------
Labels: gsoc gsoc2013 (was: gsoc gsoc2013,)
> GSoC: Implement OCR/Tika to standardize text input for cTAKES
> -------------------------------------------------------------
>
> Key: CTAKES-189
> URL: https://issues.apache.org/jira/browse/CTAKES-189
> Project: cTAKES
> Issue Type: New Feature
> Affects Versions: 3.0-incubating
> Reporter: Pei Chen
> Labels: gsoc, gsoc2013
> Fix For: 3.2
>
>
> I am proposing to have a component in cTAKES that is capable of taking in
> various types of content (PDF, Scanned JPG's, Word, XLS, TXT, etc.),
> extracting the text content before passing it on to cTAKES for NLP processing.
> There are currently open source libraries such as TIKA, JavaOCR as a starting
> point but I have not found a centralized lib that also incorporates all of
> the above including OCR into the flow easily.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira