[jira] [Updated] (PDFBOX-1912) Optical Character Recognition (OCR)

John Hewson (JIRA) Tue, 11 Feb 2014 13:22:02 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson updated PDFBOX-1912:
--------------------------------

    Description: 
Brief explanation: The PDFBox library is widely used to extract text from PDF 
files. However, many PDF files embed text in a malformed manner which renders 
text extraction useless. There has recently been interest in extracting 
governmental data from PDF files, the PDF Liberation commons being a notable 
example, see https://github.com/pdfliberation for more details.

Many end-users of PDFBox have been making use of OCR tools such as Google's 
Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final 
image generated by PDFBox. We think that by adding a more integrated OCR API to 
PDFBox it will be possible to do a better job. PDFBox often has access to 
encoding and positioning information for individual glyphs. Even when their 
extracted text is meaningless, a character-by-character, or line-by-line OCR 
could be more accurate. PDFBox also has information such as image orientation 
which could allow it to better perform OCR on pages such as embedded landscape 
tables.

There are existing JNI bindings for Tesseract available at 
https://code.google.com/p/tesseract-android-tools/

Expected results: To extend PDF box with an API which allows external OCR tools 
to be plugged-in, and an implementation of a Tesseract plug-in using either JNI 
or the command line via Process.exec.

Knowledge Prerequisite: Java, (JNI a bonus)

Mentor: John Hewson

PMC Note: Tesseract  is under the Apache License 2.0

  was:
Brief explanation: The PDFBox library is widely used to extract text from PDF 
files. However, many PDF files embed text in a malformed manner which renders 
text extraction useless. There has recently been interest in extracting 
governmental data from PDF files, the PDF Liberation commons being a notable 
example, see https://github.com/pdfliberation for more details.

Many end-users of PDFBox have been making use of OCR tools such as Google's 
Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final 
image generated by PDFBox. We think that by adding a more integrated OCR API to 
PDFBox it will be possible to do a better job. PDFBox often has access to 
encoding and positioning information for individual glyphs. Even when their 
extracted text is meaningless, a character-by-character, or line-by-line OCR 
could be more accurate. PDFBox also has information such as image orientation 
which could allow it to better perform OCR on pages such as embedded landscape 
tables.

Expected results: To extend PDF box with an API which allows external OCR tools 
to be plugged-in, and an implementation of a Tesseract plug-in using either JNI 
or the command line via Process.exec.

Knowledge Prerequisite: Java, (JNI a bonus)

Mentor: John Hewson

PMC Note: Tesseract  is under the Apache License 2.0


> Optical Character Recognition (OCR)
> -----------------------------------
>
>                 Key: PDFBOX-1912
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1912
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: JDK 6, C++
>            Reporter: John Hewson
>            Assignee: John Hewson
>              Labels: gsoc2014
>
> Brief explanation: The PDFBox library is widely used to extract text from PDF 
> files. However, many PDF files embed text in a malformed manner which renders 
> text extraction useless. There has recently been interest in extracting 
> governmental data from PDF files, the PDF Liberation commons being a notable 
> example, see https://github.com/pdfliberation for more details.
> Many end-users of PDFBox have been making use of OCR tools such as Google's 
> Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final 
> image generated by PDFBox. We think that by adding a more integrated OCR API 
> to PDFBox it will be possible to do a better job. PDFBox often has access to 
> encoding and positioning information for individual glyphs. Even when their 
> extracted text is meaningless, a character-by-character, or line-by-line OCR 
> could be more accurate. PDFBox also has information such as image orientation 
> which could allow it to better perform OCR on pages such as embedded 
> landscape tables.
> There are existing JNI bindings for Tesseract available at 
> https://code.google.com/p/tesseract-android-tools/
> Expected results: To extend PDF box with an API which allows external OCR 
> tools to be plugged-in, and an implementation of a Tesseract plug-in using 
> either JNI or the command line via Process.exec.
> Knowledge Prerequisite: Java, (JNI a bonus)
> Mentor: John Hewson
> PMC Note: Tesseract  is under the Apache License 2.0



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (PDFBOX-1912) Optical Character Recognition (OCR)

Reply via email to