Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Wed, 05 Mar 2014 09:46:42 -0800

Hi John,
+1 for you suggestion about converting image <=> byte array at java side.
It reduces lot of complexities. I don't know whether you have noticed or
not, jint data type in jni is a 32bit integer type. I noticed it in my Mac
but don't know about other operating systems. There is a lot of code
fractions in current android jni wrapper which use "(jint)somePointer"
casting which will create terrible memory leaks in 64 bit environments
because ponters are 64 bit. So I believe writing it from the beginning is
much better.


Leptonica is the image processing library for Tesseract [1]. What tesseract
do is using image processing algorithms in Leptonica to implement its OCR
algorithms. This [2] is the responsible .cpp file to create Tesseract API.
You can see it includes allheaders.h header file which is the main header
file of Leptonoca. So I think it is a must to build Leptonica first and
link it when we build Tesseract. This is not a big problem if we can use
the static library of Leptonica (I did and it worked nicely). I think it is
not a issue to use it's static library because both Tesseract and Leptonica
is under apache licence.

I'm working on the maven implementation you have mentioned and will get
back to you soon.

Thanks
Dimuthu


[1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
[2]
https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp


On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:

> Hi Dimuthu,
>
> 1,2,3:
>
> Feel free to write your own Tesseract binding or port the existing code as
> you see fit.
> The JNI binding should be minimal, only the methods you require need to be
> wrapped.
> Also, don't forget that some of the interop can be done in Java, for
> example if it is easier
> to convert a BufferedImage to a byte array in Java then do it there and
> pass the result
> to JNI rather than writing lots of JNI C++ to achieve the same result.
>
> Your GitHub repo looks like a good start, I can make comments there as
> things progress.
>
> Is it possible to build Tesseract without leptonica? I was under the
> impression that it was
> used for image i/o only, but I may be misinformed.
>
> 4:  The native platform library should be built as part of the Maven build
> for the Tesseract
> wrapper which can be a separate project. The output can be a jar file
> which contains the
> native binaries. It should be possible for the jar to contain prebuilt
> binaries for all platforms
> but this is something we can worry about later. Right now the goal should
> be to build a jar
> containing just the current platform's native binary and any Java wrapper
> code.
>
> -- John
>
> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]>
> wrote:
>
> > Hi John,
> >
> > I tried to reuse that android jni wrapper for tesseract. Here is my
> > observation
> >
> > 1. This wrapper heavily depends on android image libraries.
> > (android/bitmap.h). Most of the wrapper methods [1] use this library.
> >
> > 2. But I can understand underlying logic in each function. Basically what
> > it does is mapping between tesseract api functions [2] with java methods.
> > In between it does to some image <=> byte array like conversions by using
> > that bitmap libraries in Android
> >
> > 3. There are two ways. 1: We can port it's code to make compatible with
> our
> > environments(linux,windows and mac) which is really painful. Also it will
> > cause memory leaks. 2: We can use only it's function signatures and
> > implement using our codes
> >
> > I think 2nd solution is better because we need only few operations to be
> > done using tesseract library. I have created a github repo [3] for this.
> > It's still not finished. I need to add some make files and build files to
> > make it run properly. And also I need to implement those wrapper
> functions
> > [3]. This may take some time.
> >
> > 4. Because we are calling native libraries we need different builds of
> > tesseract and leptonica libraries for each platform (dll for windows, so
> > for linux, dylib for mac). So we may need to build those libraries at the
> > time we build pdfbox project. Or we can pre build those libraries and add
> > them to the project as .dll, .so or .dylib format. What is the preferred
> > way?
> >
> > [1]
> >
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
> > [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
> > [3] https://github.com/DImuthuUpe/Tesseract-API
> > [4]
> >
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
> >
> > Thanks
> > Dimuthu
> >
> >
> > On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
> [email protected]
> >> wrote:
> >
> >> I updated necessary changes to the document [1]
> >>
> >> For last two days I had a deep look at this [2] jni wrapper for
> tessaract
> >> api.
> >> Unfortunately this has been designed for Android environment so I think
> we
> >> need to write our own make files to build this in to a dll(windows) or
> >> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for
> a
> >> way to convert it to a make file that we can run on console. Please
> suggest
> >> if you have a better approach
> >>
> >> [1]
> >>
> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
> >> [2]
> >>
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
> >> [3]
> >>
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
> >>
> >>
> >> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote:
> >>
> >>> This is a good start. However, there is no need for the Adder
> component,
> >>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
> Extractor".
> >>>
> >>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
> >>> where the process starts.
> >>>
> >>> -- John
> >>>
> >>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]>
> >>> wrote:
> >>>
> >>>> Sorry for the mistake. I added it to my Dropbox [1].
> >>>>
> >>>> [1]
> >>>>
> >>>
> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
> >>>>
> >>>> Thanks
> >>>> Dimuthu
> >>>>
> >>>>
> >>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
> wrote:
> >>>>
> >>>>> I should add that the OCR engine should be pluggable so PDFToText
> might
> >>>>> use an interface, e.g. OCREngine and there will be a
> TesseractOCREngine
> >>>>> class somewhere which provides the required functionality and lives
> in
> >>> a
> >>>>> separate jar file.
> >>>>>
> >>>>> -- John
> >>>>>
> >>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
> wrote:
> >>>>>>
> >>>>>> So do you need to embed those new functionalities into existing
> >>>>> PDFtoText algorithms or package them as a new sub system(something
> >>> like an
> >>>>> API)?
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: "John Hewson" <[email protected]>
> >>>>>> Sent: 26/02/2014 07:38
> >>>>>> To: "[email protected]" <[email protected]>
> >>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
> >>>>> Introduction
> >>>>>>
> >>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
> page
> >>>>> rotation.
> >>>>>>
> >>>>>> There is another use case for OCR: some fonts embedded in PDFs have
> >>>>> corrupt encodings, which means the ACSII codes map to the wrong
> >>> glyphs. We
> >>>>> could OCR the glyphs to repair the encoding.
> >>>>>>
> >>>>>> -- John
> >>>>>>
> >>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
> >>> [email protected]>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Hi John,
> >>>>>>> Thanks for the explanation.
> >>>>>>> Let's say there is a pdf with both text in extractable format and
> >>> some
> >>>>>>> images with text(Scanned images). In that case first we extract
> those
> >>>>>>> extractable content using PDFBox algorithms and rest is extracted
> >>> using
> >>>>>>> OCR. Finally we pack both results together and give output as
> >>>>> PDFToText. Am
> >>>>>>> I correct? What do you mean by "location data"?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Dimuthu
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> 1. What is called "glyphs" ?
> >>>>>>>>
> >>>>>>>> http://en.wikipedia.org/wiki/Glyph
> >>>>>>>>
> >>>>>>>>> 2. What is the main requirement of this project?
> >>>>>>>>> As far as I understood, first we need to generate an image of
> >>>>>>>>> malformed pdfs from
> >>>>>>>>> PDFBox and then we need to do processing using OCR for further
> >>>>> accurate
> >>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
> >>>>> those
> >>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
> >>>>>>>>
> >>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
> >>>>> (PDFToText).
> >>>>>>>> The goal of
> >>>>>>>> this project is to enhance PDFToText so that it can use OCR to
> >>> extract
> >>>>>>>> text from areas of the
> >>>>>>>> document where the text is embedded as an image. Such PDF files
> are
> >>>>>>>> typically generated by
> >>>>>>>> scanners or fax machines. There is also another case where OCR is
> >>>>> useful:
> >>>>>>>> some fonts embedded
> >>>>>>>> in PDF files contain the wrong encoding, so when text is extracted
> >>> with
> >>>>>>>> PDFToText the result is
> >>>>>>>> nonsense but when drawn with PDFToImage we see the correct
> letters.
> >>>>>>>>
> >>>>>>>> Instead of:
> >>>>>>>> PDF => Image => OCR => Text
> >>>>>>>>
> >>>>>>>> We want to do:
> >>>>>>>> PDF => (Many images for words + location data => OCR) => Text
> >>>>>>>>
> >>>>>>>> -- John
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
> >>>>>>>> [email protected]
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Ok fixed. This is what I did
> >>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
> >>>>>>>> ->Source
> >>>>>>>>>> ->Add -> Project
> >>>>>>>>>> Then I selected PDFBox project.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Dimuthu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> >>>>>>>>>> [email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
> >>>>>>>> application
> >>>>>>>>>>> project (say TestPDFBox) with a main class with following code.
> >>>>>>>>>>>
> >>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
> >>>>>>>> PDPage();document.addPage( blankPage
> >>>>>>>> );document.save("BlankPage.pdf");document.close();
> >>>>>>>>>>>
> >>>>>>>>>>> Then I need to add those jar files generated in target folder
> of
> >>>>> PDFBox
> >>>>>>>>>>> to build path of my new project (I did build the PDFBox project
> >>> from
> >>>>>>>>>>> source). That is what I did. But let's say I need to check  the
> >>>>>>>>>>> functionality of document.save("") method. But I don't have a
> >>>>>>>> reference to
> >>>>>>>>>>> it's sources because I directly used generated jars. As Tilman
> >>> said
> >>>>> I
> >>>>>>>> built
> >>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
> other
> >>>>>>>> projects
> >>>>>>>>>>> other than adding those jar files to build path.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Which IDE are you using? You should be able to run the
> PDFToText
> >>>>> class
> >>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
> the
> >>>>>>>> command
> >>>>>>>>>>>> line argument.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -- John
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
> >>>>>>>> [email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
> >>> managed to
> >>>>>>>>>>>> build
> >>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and
> I
> >>>>> got a
> >>>>>>>>>>>> rough
> >>>>>>>>>>>>> idea about how they are working. To check them I used the
> jars
> >>> in
> >>>>>>>>>>>> target
> >>>>>>>>>>>>> folder to my separate java project. I tried samples in
> >>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
> >>> into
> >>>>> code
> >>>>>>>>>>>>> specially how those processXXX() methods work in
> >>> PDFTextStripper
> >>>>>>>> class.
> >>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
> them
> >>> in
> >>>>>>>> debug
> >>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
> you
> >>>>> follow
> >>>>>>>>>>>> in
> >>>>>>>>>>>>> order to do such task?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to
> >>> do
> >>>>> some
> >>>>>>>>>>>> OCR
> >>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
> >>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
> you a
> >>>>> mail.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
> >>> [email protected]
> >>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Dimuthu
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The PDFBox website can be found at
> http://pdfbox.apache.org/it
> >>>>>>>>>>>> contains
> >>>>>>>>>>>>>> a basic overview of the project
> >>>>>>>>>>>>>> and details on how to obtain the source code and build
> PDFBox
> >>> for
> >>>>>>>>>>>> yourself.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
> >>> the
> >>>>> only
> >>>>>>>>>>>>>> thoughts so far regarding it.
> >>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
> >>> all
> >>>>>>>> under
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> Apache license, which is a
> >>>>>>>>>>>>>> requirement.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
> >>>>> class
> >>>>>>>> to
> >>>>>>>>>>>> see
> >>>>>>>>>>>>>> how text and images are
> >>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
> >>> one
> >>>>>>>> glyph,
> >>>>>>>>>>>>>> word, or sentence at a time) with
> >>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
> text
> >>> is
> >>>>>>>>>>>> currently
> >>>>>>>>>>>>>> extracted, take a look at how
> >>>>>>>>>>>>>> we have to go to great length to sort text back into reading
> >>>>> order
> >>>>>>>> and
> >>>>>>>>>>>>>> infer the placement of diacritics - PDF
> >>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
> like
> >>>>> HTML
> >>>>>>>> -
> >>>>>>>>>>>>>> which is why extracting text can be so
> >>>>>>>>>>>>>> difficult sometimes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The full PDF Reference document can be found at:
> >>>>>>>>
> >>>>>
> >>>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
> >>>>>>>> questions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -- John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
> >>>>>>>> [email protected]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate
> at
> >>>>>>>>>>>> University
> >>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
> >>> with
> >>>>>>>>>>>> Apache
> >>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
> >>>>>>>> processing
> >>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
> GSoC
> >>>>> 2014
> >>>>>>>>>>>> project
> >>>>>>>>>>>>>> because I feel like it is the best suited project for me. In
> >>>>>>>>>>>> university
> >>>>>>>>>>>>>> also we have done some research in OCR area and our group
> >>> wrote a
> >>>>>>>>>>>>>> literature review about increasing efficiency of OCR
> >>>>>>>>>>>> systems(attached). Can
> >>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]
> >>>>>>>>
> >>>>>
> >>>
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thank you
> >>>>>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Regards
> >>>>>>>>>>>
> >>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>> Undergraduate
> >>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>
> >>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>> Undergraduate
> >>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>
> >>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>> Undergraduate
> >>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>
> >>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> W.Dimuthu Upeksha
> >>>>>>> Undergraduate
> >>>>>>> Department of Computer Science And Engineering
> >>>>>>>
> >>>>>>> University of Moratuwa, Sri Lanka
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards
> >>>>
> >>>> W.Dimuthu Upeksha
> >>>> Undergraduate
> >>>> Department of Computer Science And Engineering
> >>>>
> >>>> University of Moratuwa, Sri Lanka
> >>>
> >>>
> >>
> >>
> >> --
> >> Regards
> >>
> >> W.Dimuthu Upeksha
> >> Undergraduate
> >> Department of Computer Science And Engineering
> >>
> >> University of Moratuwa, Sri Lanka
> >>
> >
> >
> >
> > --
> > Regards
> >
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> >
> > University of Moratuwa, Sri Lanka
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to