[ 
https://issues.apache.org/jira/browse/CTAKES-189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei Chen updated CTAKES-189:
----------------------------

    Labels: gsoc gsoc2013  (was: gsoc gsoc2013,)
    
> GSoC: Implement OCR/Tika to standardize text input for cTAKES
> -------------------------------------------------------------
>
>                 Key: CTAKES-189
>                 URL: https://issues.apache.org/jira/browse/CTAKES-189
>             Project: cTAKES
>          Issue Type: New Feature
>    Affects Versions: 3.0-incubating
>            Reporter: Pei Chen
>              Labels: gsoc, gsoc2013
>             Fix For: 3.2
>
>
> I am proposing to have a component in cTAKES that is capable of taking in 
> various types of content (PDF, Scanned JPG's, Word, XLS, TXT, etc.), 
> extracting the text content before passing it on to cTAKES for NLP processing.
> There are currently open source libraries such as TIKA, JavaOCR as a starting 
> point but I have not found a centralized lib that also incorporates all of 
> the above including OCR into the flow easily.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to