[jira] [Comment Edited] (PDFBOX-1912) Optical Character Recognition (OCR)

John Hewson (JIRA) Thu, 09 Jul 2015 02:22:07 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620170#comment-14620170
 ]


John Hewson edited comment on PDFBOX-1912 at 7/9/15 9:20 AM:
-------------------------------------------------------------

Yes, that's in. The plan is to integrate it into PDFBox 2.1. You'll also need 
this: https://github.com/DImuthuUpe/Tesseract-API


was (Author: jahewson):
Yes, that's in. The plan is to integrate it into PDFBox 2.1.

> Optical Character Recognition (OCR)
> -----------------------------------
>
>                 Key: PDFBOX-1912
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1912
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: JDK 6, C/C++
>            Reporter: John Hewson
>            Assignee: John Hewson
>              Labels: gsoc2014
>             Fix For: 2.1.0
>
>
> Brief explanation: The PDFBox library is widely used to extract text from PDF 
> files. However, many PDF files embed text in a malformed manner which renders 
> text extraction useless. There has recently been interest in extracting 
> governmental data from PDF files, the PDF Liberation commons being a notable 
> example, see https://github.com/pdfliberation for more details.
> Many end-users of PDFBox have been making use of OCR tools such as Google's 
> Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final 
> image generated by PDFBox. We think that by adding a more integrated OCR API 
> to PDFBox it will be possible to do a better job. PDFBox often has access to 
> encoding and positioning information for individual glyphs. Even when their 
> extracted text is meaningless, a character-by-character, or line-by-line OCR 
> could be more accurate. PDFBox also has information such as image orientation 
> which could allow it to better perform OCR on pages such as embedded 
> landscape tables.
> There are existing JNI bindings for Tesseract available at 
> https://code.google.com/p/tesseract-android-tools/
> Expected results: To extend PDF box with an API which allows external OCR 
> tools to be plugged-in, and an implementation of a Tesseract plug-in using 
> either JNI or the command line via Process.exec.
> Knowledge Prerequisite: Java, JNI (C/C++)
> Mentor: John Hewson
> PMC Note: Tesseract  is under the Apache License 2.0
> To learn more about PDFBox, please visit http://pdfbox.apache.org/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-1912) Optical Character Recognition (OCR)

Reply via email to