John Hewson created PDFBOX-1912:
-----------------------------------
Summary: Optical Character Recognition (OCR)
Key: PDFBOX-1912
URL: https://issues.apache.org/jira/browse/PDFBOX-1912
Project: PDFBox
Issue Type: Wish
Components: Text extraction
Affects Versions: 2.0.0
Environment: JDK 6, C++
Reporter: John Hewson
Assignee: John Hewson
Brief explanation: The PDFBox library is widely used to extract text from PDF
files. However, many PDF files embed text in a malformed manner which renders
text extraction useless. There has recently been interest in extracting
governmental data from PDF files, the PDF Liberation commons being a notable
example, see https://github.com/pdfliberation for more details.
Many end-users of PDFBox have been making use of OCR tools such as Google's
Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final
image generated by PDFBox. We think that by adding a more integrated OCR API to
PDFBox it will be possible to do a better job. PDFBox often has access to
encoding and positioning information for individual glyphs, even when their
extracted text is meaningless, a character-by-character, or line-by-line OCR
could be more accurate. PDFBox also has information such as image orientation
which could allow it to better perform OCR on pages such as embedded landscape
tables.
Expected results: To extend PDF box with an API which allows external OCR tools
to be plugged-in, and an implementation of a Tesseract plug-in using either JNI
or the command line via Process.exec.
Knowledge Prerequisite: Java, (JNI a bonus)
Mentor: John Hewson
PMC Note: Tesseract is under the Apache License 2.0
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)