[ https://issues.apache.org/jira/browse/PDFBOX-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14620170#comment-14620170 ]
John Hewson edited comment on PDFBOX-1912 at 7/9/15 9:20 AM: ------------------------------------------------------------- Yes, that's in. The plan is to integrate it into PDFBox 2.1. You'll also need this: https://github.com/DImuthuUpe/Tesseract-API was (Author: jahewson): Yes, that's in. The plan is to integrate it into PDFBox 2.1. > Optical Character Recognition (OCR) > ----------------------------------- > > Key: PDFBOX-1912 > URL: https://issues.apache.org/jira/browse/PDFBOX-1912 > Project: PDFBox > Issue Type: New Feature > Components: Text extraction > Affects Versions: 2.0.0 > Environment: JDK 6, C/C++ > Reporter: John Hewson > Assignee: John Hewson > Labels: gsoc2014 > Fix For: 2.1.0 > > > Brief explanation: The PDFBox library is widely used to extract text from PDF > files. However, many PDF files embed text in a malformed manner which renders > text extraction useless. There has recently been interest in extracting > governmental data from PDF files, the PDF Liberation commons being a notable > example, see https://github.com/pdfliberation for more details. > Many end-users of PDFBox have been making use of OCR tools such as Google's > Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final > image generated by PDFBox. We think that by adding a more integrated OCR API > to PDFBox it will be possible to do a better job. PDFBox often has access to > encoding and positioning information for individual glyphs. Even when their > extracted text is meaningless, a character-by-character, or line-by-line OCR > could be more accurate. PDFBox also has information such as image orientation > which could allow it to better perform OCR on pages such as embedded > landscape tables. > There are existing JNI bindings for Tesseract available at > https://code.google.com/p/tesseract-android-tools/ > Expected results: To extend PDF box with an API which allows external OCR > tools to be plugged-in, and an implementation of a Tesseract plug-in using > either JNI or the command line via Process.exec. > Knowledge Prerequisite: Java, JNI (C/C++) > Mentor: John Hewson > PMC Note: Tesseract is under the Apache License 2.0 > To learn more about PDFBox, please visit http://pdfbox.apache.org/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org