Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Mon, 03 Mar 2014 16:42:27 -0800

Hi John,

I tried to reuse that android jni wrapper for tesseract. Here is my
observation


1. This wrapper heavily depends on android image libraries.
(android/bitmap.h). Most of the wrapper methods [1] use this library.

2. But I can understand underlying logic in each function. Basically what
it does is mapping between tesseract api functions [2] with java methods.
In between it does to some image <=> byte array like conversions by using
that bitmap libraries in Android

3. There are two ways. 1: We can port it's code to make compatible with our
environments(linux,windows and mac) which is really painful. Also it will
cause memory leaks. 2: We can use only it's function signatures and
implement using our codes

I think 2nd solution is better because we need only few operations to be
done using tesseract library. I have created a github repo [3] for this.
It's still not finished. I need to add some make files and build files to
make it run properly. And also I need to implement those wrapper functions
[3]. This may take some time.

4. Because we are calling native libraries we need different builds of
tesseract and leptonica libraries for each platform (dll for windows, so
for linux, dylib for mac). So we may need to build those libraries at the
time we build pdfbox project. Or we can pre build those libraries and add
them to the project as .dll, .so or .dylib format. What is the preferred
way?

[1]
https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
[2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
[3] https://github.com/DImuthuUpe/Tesseract-API
[4]
https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp

Thanks
Dimuthu


On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <[email protected]
> wrote:

> I updated necessary changes to the document [1]
>
> For last two days I had a deep look at this [2] jni wrapper for tessaract
> api.
> Unfortunately this has been designed for Android environment so I think we
> need to write our own make files to build this in to a dll(windows) or
> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
> way to convert it to a make file that we can run on console. Please suggest
> if you have a better approach
>
> [1]
> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
> [2]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
> [3]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>
>
> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote:
>
>> This is a good start. However, there is no need for the Adder component,
>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>>
>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>> where the process starts.
>>
>> -- John
>>
>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]>
>> wrote:
>>
>> > Sorry for the mistake. I added it to my Dropbox [1].
>> >
>> > [1]
>> >
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>> >
>> > Thanks
>> > Dimuthu
>> >
>> >
>> > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote:
>> >
>> >> I should add that the OCR engine should be pluggable so PDFToText might
>> >> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
>> >> class somewhere which provides the required functionality and lives in
>> a
>> >> separate jar file.
>> >>
>> >> -- John
>> >>
>> >>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote:
>> >>>
>> >>> So do you need to embed those new functionalities into existing
>> >> PDFtoText algorithms or package them as a new sub system(something
>> like an
>> >> API)?
>> >>>
>> >>> -----Original Message-----
>> >>> From: "John Hewson" <[email protected]>
>> >>> Sent: 26/02/2014 07:38
>> >>> To: "[email protected]" <[email protected]>
>> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>> >> Introduction
>> >>>
>> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>> >> rotation.
>> >>>
>> >>> There is another use case for OCR: some fonts embedded in PDFs have
>> >> corrupt encodings, which means the ACSII codes map to the wrong
>> glyphs. We
>> >> could OCR the glyphs to repair the encoding.
>> >>>
>> >>> -- John
>> >>>
>> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>> [email protected]>
>> >> wrote:
>> >>>>
>> >>>> Hi John,
>> >>>> Thanks for the explanation.
>> >>>> Let's say there is a pdf with both text in extractable format and
>> some
>> >>>> images with text(Scanned images). In that case first we extract those
>> >>>> extractable content using PDFBox algorithms and rest is extracted
>> using
>> >>>> OCR. Finally we pack both results together and give output as
>> >> PDFToText. Am
>> >>>> I correct? What do you mean by "location data"?
>> >>>>
>> >>>> Thanks
>> >>>> Dimuthu
>> >>>>
>> >>>>
>> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]>
>> >> wrote:
>> >>>>>
>> >>>>> 1. What is called "glyphs" ?
>> >>>>>
>> >>>>> http://en.wikipedia.org/wiki/Glyph
>> >>>>>
>> >>>>>> 2. What is the main requirement of this project?
>> >>>>>> As far as I understood, first we need to generate an image of
>> >>>>>> malformed pdfs from
>> >>>>>> PDFBox and then we need to do processing using OCR for further
>> >> accurate
>> >>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>> >> those
>> >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>> >>>>>
>> >>>>> PDFBox can generate images (PDFToImage) and can extract text
>> >> (PDFToText).
>> >>>>> The goal of
>> >>>>> this project is to enhance PDFToText so that it can use OCR to
>> extract
>> >>>>> text from areas of the
>> >>>>> document where the text is embedded as an image. Such PDF files are
>> >>>>> typically generated by
>> >>>>> scanners or fax machines. There is also another case where OCR is
>> >> useful:
>> >>>>> some fonts embedded
>> >>>>> in PDF files contain the wrong encoding, so when text is extracted
>> with
>> >>>>> PDFToText the result is
>> >>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>> >>>>>
>> >>>>> Instead of:
>> >>>>> PDF => Image => OCR => Text
>> >>>>>
>> >>>>> We want to do:
>> >>>>> PDF => (Many images for words + location data => OCR) => Text
>> >>>>>
>> >>>>> -- John
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>> >>>>> [email protected]
>> >>>>>>> wrote:
>> >>>>>>
>> >>>>>>> Ok fixed. This is what I did
>> >>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>> >>>>> ->Source
>> >>>>>>> ->Add -> Project
>> >>>>>>> Then I selected PDFBox project.
>> >>>>>>>
>> >>>>>>> Thanks
>> >>>>>>> Dimuthu
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>> >>>>>>> [email protected]> wrote:
>> >>>>>>>
>> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>> >>>>> application
>> >>>>>>>> project (say TestPDFBox) with a main class with following code.
>> >>>>>>>>
>> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>> >>>>> PDPage();document.addPage( blankPage
>> >>>>> );document.save("BlankPage.pdf");document.close();
>> >>>>>>>>
>> >>>>>>>> Then I need to add those jar files generated in target folder of
>> >> PDFBox
>> >>>>>>>> to build path of my new project (I did build the PDFBox project
>> from
>> >>>>>>>> source). That is what I did. But let's say I need to check  the
>> >>>>>>>> functionality of document.save("") method. But I don't have a
>> >>>>> reference to
>> >>>>>>>> it's sources because I directly used generated jars. As Tilman
>> said
>> >> I
>> >>>>> built
>> >>>>>>>> PDFBox from sources but I don't know a proper way to use it other
>> >>>>> projects
>> >>>>>>>> other than adding those jar files to build path.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]>
>> >>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
>> >> class
>> >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>> >>>>> command
>> >>>>>>>>> line argument.
>> >>>>>>>>>
>> >>>>>>>>> -- John
>> >>>>>>>>>
>> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>> >>>>> [email protected]>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi John,
>> >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>> managed to
>> >>>>>>>>> build
>> >>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>> >> got a
>> >>>>>>>>> rough
>> >>>>>>>>>> idea about how they are working. To check them I used the jars
>> in
>> >>>>>>>>> target
>> >>>>>>>>>> folder to my separate java project. I tried samples in
>> >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>> into
>> >> code
>> >>>>>>>>>> specially how those processXXX() methods work in
>> PDFTextStripper
>> >>>>> class.
>> >>>>>>>>>> What I usually do is adding some berakpoints and checking them
>> in
>> >>>>> debug
>> >>>>>>>>>> windows. But using jars it's not possible. What is the way you
>> >> follow
>> >>>>>>>>> in
>> >>>>>>>>>> order to do such task?
>> >>>>>>>>>>
>> >>>>>>>>>> As well I installed tesseract in to my machine and managed to
>> do
>> >> some
>> >>>>>>>>> OCR
>> >>>>>>>>>> stuff also. That's a cool tool which works fine.
>> >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
>> >> mail.
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks
>> >>>>>>>>>> Dimuthu
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>> [email protected]
>> >>>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi Dimuthu
>> >>>>>>>>>>>
>> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>> >>>>>>>>> contains
>> >>>>>>>>>>> a basic overview of the project
>> >>>>>>>>>>> and details on how to obtain the source code and build PDFBox
>> for
>> >>>>>>>>> yourself.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
>> the
>> >> only
>> >>>>>>>>>>> thoughts so far regarding it.
>> >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
>> all
>> >>>>> under
>> >>>>>>>>> the
>> >>>>>>>>>>> Apache license, which is a
>> >>>>>>>>>>> requirement.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>> >> class
>> >>>>> to
>> >>>>>>>>> see
>> >>>>>>>>>>> how text and images are
>> >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
>> one
>> >>>>> glyph,
>> >>>>>>>>>>> word, or sentence at a time) with
>> >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
>> is
>> >>>>>>>>> currently
>> >>>>>>>>>>> extracted, take a look at how
>> >>>>>>>>>>> we have to go to great length to sort text back into reading
>> >> order
>> >>>>> and
>> >>>>>>>>>>> infer the placement of diacritics - PDF
>> >>>>>>>>>>> is fundamentally a visual format, not a structured format like
>> >> HTML
>> >>>>> -
>> >>>>>>>>>>> which is why extracting text can be so
>> >>>>>>>>>>> difficult sometimes.
>> >>>>>>>>>>>
>> >>>>>>>>>>> The full PDF Reference document can be found at:
>> >>>>>
>> >>
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>> >>>>>>>>>>>
>> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>> >>>>> questions.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks,
>> >>>>>>>>>>>
>> >>>>>>>>>>> -- John
>> >>>>>>>>>>>
>> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>> >>>>> [email protected]
>> >>>>>>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hi,
>> >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>> >>>>>>>>> University
>> >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>> with
>> >>>>>>>>> Apache
>> >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>> >>>>> processing
>> >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>> >> 2014
>> >>>>>>>>> project
>> >>>>>>>>>>> because I feel like it is the best suited project for me. In
>> >>>>>>>>> university
>> >>>>>>>>>>> also we have done some research in OCR area and our group
>> wrote a
>> >>>>>>>>>>> literature review about increasing efficiency of OCR
>> >>>>>>>>> systems(attached). Can
>> >>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> [1]
>> >>>>>
>> >>
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thank you
>> >>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> --
>> >>>>>>>>>>>> Regards
>> >>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>> Regards
>> >>>>>>>>>>
>> >>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>> Undergraduate
>> >>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>
>> >>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Regards
>> >>>>>>>>
>> >>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>> Undergraduate
>> >>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>
>> >>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Regards
>> >>>>>>>
>> >>>>>>> W.Dimuthu Upeksha
>> >>>>>>> Undergraduate
>> >>>>>>> Department of Computer Science And Engineering
>> >>>>>>>
>> >>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Regards
>> >>>>>>
>> >>>>>> W.Dimuthu Upeksha
>> >>>>>> Undergraduate
>> >>>>>> Department of Computer Science And Engineering
>> >>>>>>
>> >>>>>> University of Moratuwa, Sri Lanka
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Regards
>> >>>>
>> >>>> W.Dimuthu Upeksha
>> >>>> Undergraduate
>> >>>> Department of Computer Science And Engineering
>> >>>>
>> >>>> University of Moratuwa, Sri Lanka
>> >>
>> >
>> >
>> >
>> > --
>> > Regards
>> >
>> > W.Dimuthu Upeksha
>> > Undergraduate
>> > Department of Computer Science And Engineering
>> >
>> > University of Moratuwa, Sri Lanka
>>
>>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to