Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Tue, 04 Mar 2014 11:46:39 -0800

Hi Dimuthu,

1,2,3:


Feel free to write your own Tesseract binding or port the existing code as you 
see fit.
The JNI binding should be minimal, only the methods you require need to be 
wrapped.
Also, don’t forget that some of the interop can be done in Java, for example if 
it is easier
to convert a BufferedImage to a byte array in Java then do it there and pass 
the result
to JNI rather than writing lots of JNI C++ to achieve the same result.

Your GitHub repo looks like a good start, I can make comments there as things 
progress.

Is it possible to build Tesseract without leptonica? I was under the impression 
that it was
used for image i/o only, but I may be misinformed.

4:  The native platform library should be built as part of the Maven build for 
the Tesseract
wrapper which can be a separate project. The output can be a jar file which 
contains the
native binaries. It should be possible for the jar to contain prebuilt binaries 
for all platforms
but this is something we can worry about later. Right now the goal should be to 
build a jar
containing just the current platform’s native binary and any Java wrapper code.

-- John

On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]> wrote:

> Hi John,
> 
> I tried to reuse that android jni wrapper for tesseract. Here is my
> observation
> 
> 1. This wrapper heavily depends on android image libraries.
> (android/bitmap.h). Most of the wrapper methods [1] use this library.
> 
> 2. But I can understand underlying logic in each function. Basically what
> it does is mapping between tesseract api functions [2] with java methods.
> In between it does to some image <=> byte array like conversions by using
> that bitmap libraries in Android
> 
> 3. There are two ways. 1: We can port it's code to make compatible with our
> environments(linux,windows and mac) which is really painful. Also it will
> cause memory leaks. 2: We can use only it's function signatures and
> implement using our codes
> 
> I think 2nd solution is better because we need only few operations to be
> done using tesseract library. I have created a github repo [3] for this.
> It's still not finished. I need to add some make files and build files to
> make it run properly. And also I need to implement those wrapper functions
> [3]. This may take some time.
> 
> 4. Because we are calling native libraries we need different builds of
> tesseract and leptonica libraries for each platform (dll for windows, so
> for linux, dylib for mac). So we may need to build those libraries at the
> time we build pdfbox project. Or we can pre build those libraries and add
> them to the project as .dll, .so or .dylib format. What is the preferred
> way?
> 
> [1]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
> [3] https://github.com/DImuthuUpe/Tesseract-API
> [4]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
> 
> Thanks
> Dimuthu
> 
> 
> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <[email protected]
>> wrote:
> 
>> I updated necessary changes to the document [1]
>> 
>> For last two days I had a deep look at this [2] jni wrapper for tessaract
>> api.
>> Unfortunately this has been designed for Android environment so I think we
>> need to write our own make files to build this in to a dll(windows) or
>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
>> way to convert it to a make file that we can run on console. Please suggest
>> if you have a better approach
>> 
>> [1]
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>> [2]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>> [3]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>> 
>> 
>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote:
>> 
>>> This is a good start. However, there is no need for the Adder component,
>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>>> 
>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>>> where the process starts.
>>> 
>>> -- John
>>> 
>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]>
>>> wrote:
>>> 
>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>> 
>>>> [1]
>>>> 
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote:
>>>> 
>>>>> I should add that the OCR engine should be pluggable so PDFToText might
>>>>> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
>>>>> class somewhere which provides the required functionality and lives in
>>> a
>>>>> separate jar file.
>>>>> 
>>>>> -- John
>>>>> 
>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote:
>>>>>> 
>>>>>> So do you need to embed those new functionalities into existing
>>>>> PDFtoText algorithms or package them as a new sub system(something
>>> like an
>>>>> API)?
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: "John Hewson" <[email protected]>
>>>>>> Sent: 26/02/2014 07:38
>>>>>> To: "[email protected]" <[email protected]>
>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>> Introduction
>>>>>> 
>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>>>>> rotation.
>>>>>> 
>>>>>> There is another use case for OCR: some fonts embedded in PDFs have
>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>> glyphs. We
>>>>> could OCR the glyphs to repair the encoding.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>> [email protected]>
>>>>> wrote:
>>>>>>> 
>>>>>>> Hi John,
>>>>>>> Thanks for the explanation.
>>>>>>> Let's say there is a pdf with both text in extractable format and
>>> some
>>>>>>> images with text(Scanned images). In that case first we extract those
>>>>>>> extractable content using PDFBox algorithms and rest is extracted
>>> using
>>>>>>> OCR. Finally we pack both results together and give output as
>>>>> PDFToText. Am
>>>>>>> I correct? What do you mean by "location data"?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>> 
>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>> 
>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>> malformed pdfs from
>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>> accurate
>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>>>>> those
>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>>>>>>> 
>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>> (PDFToText).
>>>>>>>> The goal of
>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>> extract
>>>>>>>> text from areas of the
>>>>>>>> document where the text is embedded as an image. Such PDF files are
>>>>>>>> typically generated by
>>>>>>>> scanners or fax machines. There is also another case where OCR is
>>>>> useful:
>>>>>>>> some fonts embedded
>>>>>>>> in PDF files contain the wrong encoding, so when text is extracted
>>> with
>>>>>>>> PDFToText the result is
>>>>>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>>>>>>> 
>>>>>>>> Instead of:
>>>>>>>> PDF => Image => OCR => Text
>>>>>>>> 
>>>>>>>> We want to do:
>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>> [email protected]
>>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>>>>>>> ->Source
>>>>>>>>>> ->Add -> Project
>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>> application
>>>>>>>>>>> project (say TestPDFBox) with a main class with following code.
>>>>>>>>>>> 
>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>> 
>>>>>>>>>>> Then I need to add those jar files generated in target folder of
>>>>> PDFBox
>>>>>>>>>>> to build path of my new project (I did build the PDFBox project
>>> from
>>>>>>>>>>> source). That is what I did. But let's say I need to check  the
>>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>>> reference to
>>>>>>>>>>> it's sources because I directly used generated jars. As Tilman
>>> said
>>>>> I
>>>>>>>> built
>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it other
>>>>>>>> projects
>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
>>>>> class
>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>>>>>>>> command
>>>>>>>>>>>> line argument.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>> managed to
>>>>>>>>>>>> build
>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>>>>> got a
>>>>>>>>>>>> rough
>>>>>>>>>>>>> idea about how they are working. To check them I used the jars
>>> in
>>>>>>>>>>>> target
>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>> into
>>>>> code
>>>>>>>>>>>>> specially how those processXXX() methods work in
>>> PDFTextStripper
>>>>>>>> class.
>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking them
>>> in
>>>>>>>> debug
>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way you
>>>>> follow
>>>>>>>>>>>> in
>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to
>>> do
>>>>> some
>>>>>>>>>>>> OCR
>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
>>>>> mail.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>> [email protected]
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>>>>>>>>>>>> contains
>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>> and details on how to obtain the source code and build PDFBox
>>> for
>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
>>> the
>>>>> only
>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
>>> all
>>>>>>>> under
>>>>>>>>>>>> the
>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>>>>> class
>>>>>>>> to
>>>>>>>>>>>> see
>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
>>> one
>>>>>>>> glyph,
>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
>>> is
>>>>>>>>>>>> currently
>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>> we have to go to great length to sort text back into reading
>>>>> order
>>>>>>>> and
>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format like
>>>>> HTML
>>>>>>>> -
>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>> 
>>>>> 
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>>> questions.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>> [email protected]
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>>>>>>>>>>> University
>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>>> with
>>>>>>>>>>>> Apache
>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>>>>>>> processing
>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>>>>> 2014
>>>>>>>>>>>> project
>>>>>>>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>>>>>>>> university
>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>> wrote a
>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>>>> 
>>>>> 
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>> 
>> 
>> --
>> Regards
>> 
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>> 
>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to