Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Mon, 03 Mar 2014 16:53:33 -0800

Hi John,
I just noticed your last reply just after sending my previous mail. Sorry
about that. I'm using Mac also and I'm also using VMs to test other
platforms. I have done a lot of stuff using maven. I'll go through the
plugin and try to apply it to that github project.


Thanks
Dimuthu


On Tue, Mar 4, 2014 at 6:11 AM, DImuthu Upeksha
<[email protected]>wrote:

> Hi John,
>
> I tried to reuse that android jni wrapper for tesseract. Here is my
> observation
>
> 1. This wrapper heavily depends on android image libraries.
> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>
> 2. But I can understand underlying logic in each function. Basically what
> it does is mapping between tesseract api functions [2] with java methods.
> In between it does to some image <=> byte array like conversions by using
> that bitmap libraries in Android
>
> 3. There are two ways. 1: We can port it's code to make compatible with
> our environments(linux,windows and mac) which is really painful. Also it
> will cause memory leaks. 2: We can use only it's function signatures and
> implement using our codes
>
> I think 2nd solution is better because we need only few operations to be
> done using tesseract library. I have created a github repo [3] for this.
> It's still not finished. I need to add some make files and build files to
> make it run properly. And also I need to implement those wrapper functions
> [3]. This may take some time.
>
> 4. Because we are calling native libraries we need different builds of
> tesseract and leptonica libraries for each platform (dll for windows, so
> for linux, dylib for mac). So we may need to build those libraries at the
> time we build pdfbox project. Or we can pre build those libraries and add
> them to the project as .dll, .so or .dylib format. What is the preferred
> way?
>
> [1]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
> [3] https://github.com/DImuthuUpe/Tesseract-API
> [4]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>
> Thanks
> Dimuthu
>
>
> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
> [email protected]> wrote:
>
>> I updated necessary changes to the document [1]
>>
>> For last two days I had a deep look at this [2] jni wrapper for tessaract
>> api.
>> Unfortunately this has been designed for Android environment so I think
>> we need to write our own make files to build this in to a dll(windows) or
>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
>> way to convert it to a make file that we can run on console. Please suggest
>> if you have a better approach
>>
>> [1]
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>> [2]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>> [3]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>
>>
>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote:
>>
>>> This is a good start. However, there is no need for the Adder component,
>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>>>
>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>>> where the process starts.
>>>
>>> -- John
>>>
>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]>
>>> wrote:
>>>
>>> > Sorry for the mistake. I added it to my Dropbox [1].
>>> >
>>> > [1]
>>> >
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>> >
>>> > Thanks
>>> > Dimuthu
>>> >
>>> >
>>> > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>>> wrote:
>>> >
>>> >> I should add that the OCR engine should be pluggable so PDFToText
>>> might
>>> >> use an interface, e.g. OCREngine and there will be a
>>> TesseractOCREngine
>>> >> class somewhere which provides the required functionality and lives
>>> in a
>>> >> separate jar file.
>>> >>
>>> >> -- John
>>> >>
>>> >>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>>> wrote:
>>> >>>
>>> >>> So do you need to embed those new functionalities into existing
>>> >> PDFtoText algorithms or package them as a new sub system(something
>>> like an
>>> >> API)?
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: "John Hewson" <[email protected]>
>>> >>> Sent: 26/02/2014 07:38
>>> >>> To: "[email protected]" <[email protected]>
>>> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>> >> Introduction
>>> >>>
>>> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>>> >> rotation.
>>> >>>
>>> >>> There is another use case for OCR: some fonts embedded in PDFs have
>>> >> corrupt encodings, which means the ACSII codes map to the wrong
>>> glyphs. We
>>> >> could OCR the glyphs to repair the encoding.
>>> >>>
>>> >>> -- John
>>> >>>
>>> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>> [email protected]>
>>> >> wrote:
>>> >>>>
>>> >>>> Hi John,
>>> >>>> Thanks for the explanation.
>>> >>>> Let's say there is a pdf with both text in extractable format and
>>> some
>>> >>>> images with text(Scanned images). In that case first we extract
>>> those
>>> >>>> extractable content using PDFBox algorithms and rest is extracted
>>> using
>>> >>>> OCR. Finally we pack both results together and give output as
>>> >> PDFToText. Am
>>> >>>> I correct? What do you mean by "location data"?
>>> >>>>
>>> >>>> Thanks
>>> >>>> Dimuthu
>>> >>>>
>>> >>>>
>>> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]>
>>> >> wrote:
>>> >>>>>
>>> >>>>> 1. What is called "glyphs" ?
>>> >>>>>
>>> >>>>> http://en.wikipedia.org/wiki/Glyph
>>> >>>>>
>>> >>>>>> 2. What is the main requirement of this project?
>>> >>>>>> As far as I understood, first we need to generate an image of
>>> >>>>>> malformed pdfs from
>>> >>>>>> PDFBox and then we need to do processing using OCR for further
>>> >> accurate
>>> >>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>>> >> those
>>> >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>> >>>>>
>>> >>>>> PDFBox can generate images (PDFToImage) and can extract text
>>> >> (PDFToText).
>>> >>>>> The goal of
>>> >>>>> this project is to enhance PDFToText so that it can use OCR to
>>> extract
>>> >>>>> text from areas of the
>>> >>>>> document where the text is embedded as an image. Such PDF files are
>>> >>>>> typically generated by
>>> >>>>> scanners or fax machines. There is also another case where OCR is
>>> >> useful:
>>> >>>>> some fonts embedded
>>> >>>>> in PDF files contain the wrong encoding, so when text is extracted
>>> with
>>> >>>>> PDFToText the result is
>>> >>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>> >>>>>
>>> >>>>> Instead of:
>>> >>>>> PDF => Image => OCR => Text
>>> >>>>>
>>> >>>>> We want to do:
>>> >>>>> PDF => (Many images for words + location data => OCR) => Text
>>> >>>>>
>>> >>>>> -- John
>>> >>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>> >>>>> [email protected]
>>> >>>>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Ok fixed. This is what I did
>>> >>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>> >>>>> ->Source
>>> >>>>>>> ->Add -> Project
>>> >>>>>>> Then I selected PDFBox project.
>>> >>>>>>>
>>> >>>>>>> Thanks
>>> >>>>>>> Dimuthu
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>> >>>>>>> [email protected]> wrote:
>>> >>>>>>>
>>> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>> >>>>> application
>>> >>>>>>>> project (say TestPDFBox) with a main class with following code.
>>> >>>>>>>>
>>> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>> >>>>> PDPage();document.addPage( blankPage
>>> >>>>> );document.save("BlankPage.pdf");document.close();
>>> >>>>>>>>
>>> >>>>>>>> Then I need to add those jar files generated in target folder of
>>> >> PDFBox
>>> >>>>>>>> to build path of my new project (I did build the PDFBox project
>>> from
>>> >>>>>>>> source). That is what I did. But let's say I need to check  the
>>> >>>>>>>> functionality of document.save("") method. But I don't have a
>>> >>>>> reference to
>>> >>>>>>>> it's sources because I directly used generated jars. As Tilman
>>> said
>>> >> I
>>> >>>>> built
>>> >>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>> other
>>> >>>>> projects
>>> >>>>>>>> other than adding those jar files to build path.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]
>>> >
>>> >>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Which IDE are you using? You should be able to run the
>>> PDFToText
>>> >> class
>>> >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
>>> the
>>> >>>>> command
>>> >>>>>>>>> line argument.
>>> >>>>>>>>>
>>> >>>>>>>>> -- John
>>> >>>>>>>>>
>>> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>> >>>>> [email protected]>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi John,
>>> >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>> managed to
>>> >>>>>>>>> build
>>> >>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>>> >> got a
>>> >>>>>>>>> rough
>>> >>>>>>>>>> idea about how they are working. To check them I used the
>>> jars in
>>> >>>>>>>>> target
>>> >>>>>>>>>> folder to my separate java project. I tried samples in
>>> >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>> into
>>> >> code
>>> >>>>>>>>>> specially how those processXXX() methods work in
>>> PDFTextStripper
>>> >>>>> class.
>>> >>>>>>>>>> What I usually do is adding some berakpoints and checking
>>> them in
>>> >>>>> debug
>>> >>>>>>>>>> windows. But using jars it's not possible. What is the way you
>>> >> follow
>>> >>>>>>>>> in
>>> >>>>>>>>>> order to do such task?
>>> >>>>>>>>>>
>>> >>>>>>>>>> As well I installed tesseract in to my machine and managed to
>>> do
>>> >> some
>>> >>>>>>>>> OCR
>>> >>>>>>>>>> stuff also. That's a cool tool which works fine.
>>> >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you
>>> a
>>> >> mail.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Thanks
>>> >>>>>>>>>> Dimuthu
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>> [email protected]
>>> >>>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Hi Dimuthu
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>>> >>>>>>>>> contains
>>> >>>>>>>>>>> a basic overview of the project
>>> >>>>>>>>>>> and details on how to obtain the source code and build
>>> PDFBox for
>>> >>>>>>>>> yourself.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
>>> the
>>> >> only
>>> >>>>>>>>>>> thoughts so far regarding it.
>>> >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
>>> all
>>> >>>>> under
>>> >>>>>>>>> the
>>> >>>>>>>>>>> Apache license, which is a
>>> >>>>>>>>>>> requirement.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>>> >> class
>>> >>>>> to
>>> >>>>>>>>> see
>>> >>>>>>>>>>> how text and images are
>>> >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
>>> one
>>> >>>>> glyph,
>>> >>>>>>>>>>> word, or sentence at a time) with
>>> >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>> text is
>>> >>>>>>>>> currently
>>> >>>>>>>>>>> extracted, take a look at how
>>> >>>>>>>>>>> we have to go to great length to sort text back into reading
>>> >> order
>>> >>>>> and
>>> >>>>>>>>>>> infer the placement of diacritics - PDF
>>> >>>>>>>>>>> is fundamentally a visual format, not a structured format
>>> like
>>> >> HTML
>>> >>>>> -
>>> >>>>>>>>>>> which is why extracting text can be so
>>> >>>>>>>>>>> difficult sometimes.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> The full PDF Reference document can be found at:
>>> >>>>>
>>> >>
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>> >>>>> questions.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> -- John
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>> >>>>> [email protected]
>>> >>>>>>>>>>
>>> >>>>>>>>>>> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> Hi,
>>> >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate
>>> at
>>> >>>>>>>>> University
>>> >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>>> with
>>> >>>>>>>>> Apache
>>> >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>> >>>>> processing
>>> >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>>> >> 2014
>>> >>>>>>>>> project
>>> >>>>>>>>>>> because I feel like it is the best suited project for me. In
>>> >>>>>>>>> university
>>> >>>>>>>>>>> also we have done some research in OCR area and our group
>>> wrote a
>>> >>>>>>>>>>> literature review about increasing efficiency of OCR
>>> >>>>>>>>> systems(attached). Can
>>> >>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> [1]
>>> >>>>>
>>> >>
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Thank you
>>> >>>>>>>>>>>> Dimuthu
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> --
>>> >>>>>>>>>>>> Regards
>>> >>>>>>>>>>>> W.Dimuthu Upeksha
>>> >>>>>>>>>>>> Undergraduate
>>> >>>>>>>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> --
>>> >>>>>>>>>> Regards
>>> >>>>>>>>>>
>>> >>>>>>>>>> W.Dimuthu Upeksha
>>> >>>>>>>>>> Undergraduate
>>> >>>>>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>>>>
>>> >>>>>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> --
>>> >>>>>>>> Regards
>>> >>>>>>>>
>>> >>>>>>>> W.Dimuthu Upeksha
>>> >>>>>>>> Undergraduate
>>> >>>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>>
>>> >>>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Regards
>>> >>>>>>>
>>> >>>>>>> W.Dimuthu Upeksha
>>> >>>>>>> Undergraduate
>>> >>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>
>>> >>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Regards
>>> >>>>>>
>>> >>>>>> W.Dimuthu Upeksha
>>> >>>>>> Undergraduate
>>> >>>>>> Department of Computer Science And Engineering
>>> >>>>>>
>>> >>>>>> University of Moratuwa, Sri Lanka
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Regards
>>> >>>>
>>> >>>> W.Dimuthu Upeksha
>>> >>>> Undergraduate
>>> >>>> Department of Computer Science And Engineering
>>> >>>>
>>> >>>> University of Moratuwa, Sri Lanka
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards
>>> >
>>> > W.Dimuthu Upeksha
>>> > Undergraduate
>>> > Department of Computer Science And Engineering
>>> >
>>> > University of Moratuwa, Sri Lanka
>>>
>>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>>
>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to