Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Tue, 11 Mar 2014 12:12:12 -0700

Hi Dimuthu

The Tesseract wrapper needs to take its input from a BufferedImage rather than 
reading a file from disk, so instead of:


api.setImagePath(“test.tif”);

What we need is:

BufferedImage image = ImageIO.read(new File(“test.tif"));
api.setImagePath(image);

Because this will let us used the BufferedImage generated by PDFRenderer 
without round-tripping to the disk.

-- John

On 11 Mar 2014, at 11:13, DImuthu Upeksha <[email protected]> wrote:

> Hi John,
> Thanks for the guidance.
> I did a small analysis of the accuracy and performance of new
> Tesseract wrapper. I used this [1] image as the input image and got
> following data [2] after OCR. First line is the recognised word
> followed by location details (bounding box) of the word. I think these
> details are pretty much enough for our task. Now what remaining is
> converting pdf file into a image as you have mentioned. These days I'm
> working on it.
> 
> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
> [2] https://gist.github.com/DImuthuUpe/9491660
> 
> Thanks
> Dimuthu
> 
> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <[email protected]> wrote:
>> Dimuthu,
>> 
>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>> build using maven. Some useful methods that are needed to do basic OCR were
>>> implemented.
>> 
>> Great, it's looking good, nice and clean.
>> 
>>> 1. What is the task of processStream method in PDFTextStripper class line
>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>> page.findRotation() );
>> 
>> A PDF file is made up of pages, each of which contains a "content stream". 
>> This content stream contains a list of drawing commands such as "move to 
>> 10,15" or "write the word `foo`", these are called operators. The 
>> processStream function reads the stream for the current page and executes 
>> each of the operators. The operators themselves are implemented each in 
>> their own class which is a subclass of PDFOperator. The constructor of 
>> PDFStreamEngine creates the operator classes using reflection, which is 
>> rather odd and I'm not sure why this design was chosen. The operators used 
>> by PDFTextStripper can be found in 
>> org/apache/pdfbox/resources/PDFTextStripper.properties
>> 
>>> 2. Say I need to extract images and it's metadata from a pdf. What is the 
>>> better approach to do it?
>> 
>> You could subclass PDFTextStripper and override the startDocument method and 
>> use it to create a PDFRenderer and store it in a field. Then override the 
>> processPage method and use the previously created PDFRenderer to render the 
>> current page to a buffered image and perform OCR on the image. Once you have 
>> the OCR text + positions, instead of calling processStream you can call 
>> processTextPosition once for each character + position.
>> 
>> The PDFRenderer class was just added to the trunk, so make sure you do an 
>> "svn update". Let me know if you need me to change PDFTextStripper to make 
>> it easier to subclass.
>> 
>> Cheers
>> 
>> -- John
>> 
>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <[email protected]> wrote:
>> 
>>> Hi John,
>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>> build using maven. Some useful methods that are needed to do basic OCR were
>>> implemented.
>>> 
>>> I went through PDFBox code several times and got couple of issues that are
>>> needed to be clarified
>>> 
>>> 1. What is the task of processStream method in PDFTextStripper class line
>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>> page.findRotation() );
>>> 
>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>> better approach to do it?
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> 
>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>> <[email protected]>wrote:
>>> 
>>>> Hi John
>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>> resources folder[2]. For now it includes librararies supported for mac. But
>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>> install", the jar is created under target folder. Now all setting up is
>>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>> about project structure.
>>>> 
>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>> [2]
>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>> [3]
>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>> 
>>>>> Dimuthu
>>>>> 
>>>>>> There is a lot of code
>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>> is
>>>>>> much better.
>>>>> 
>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>> support
>>>>> 64-bit JVMs.
>>>>> 
>>>>>> we can use
>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>> it is
>>>>>> not a issue to use it's static library because both Tesseract and
>>>>> Leptonica
>>>>>> is under apache licence.
>>>>> 
>>>>> Sounds good, I found the following in the README:
>>>>> 
>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>> without Leptonica.
>>>>> 
>>>>> Which makes sense.
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi John,
>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>> side.
>>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>> Mac
>>>>>> but don't know about other operating systems.
>>>>>> 
>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>> tesseract
>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>> OCR
>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>> API.
>>>>>> You can see it includes allheaders.h header file which is the main
>>>>> header
>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>> it is
>>>>>> not a issue to use it's static library because both Tesseract and
>>>>> Leptonica
>>>>>> is under apache licence.
>>>>>> 
>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>> back to you soon.
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>> [2]
>>>>>> 
>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>> 
>>>>>> 
>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>>>> 
>>>>>>> Hi Dimuthu,
>>>>>>> 
>>>>>>> 1,2,3:
>>>>>>> 
>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>> code as
>>>>>>> you see fit.
>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>> to be
>>>>>>> wrapped.
>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>> example if it is easier
>>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>>> pass the result
>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>> 
>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>> things progress.
>>>>>>> 
>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>> impression that it was
>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>> 
>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>> build
>>>>>>> for the Tesseract
>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>> which contains the
>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>> binaries for all platforms
>>>>>>> but this is something we can worry about later. Right now the goal
>>>>> should
>>>>>>> be to build a jar
>>>>>>> containing just the current platform's native binary and any Java
>>>>> wrapper
>>>>>>> code.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi John,
>>>>>>>> 
>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>> observation
>>>>>>>> 
>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>> 
>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>> what
>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>> methods.
>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>> using
>>>>>>>> that bitmap libraries in Android
>>>>>>>> 
>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>> with
>>>>>>> our
>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>> will
>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>> implement using our codes
>>>>>>>> 
>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>> be
>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>> this.
>>>>>>>> It's still not finished. I need to add some make files and build
>>>>> files to
>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>> functions
>>>>>>>> [3]. This may take some time.
>>>>>>>> 
>>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>> so
>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>> the
>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>> add
>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>> preferred
>>>>>>>> way?
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> 
>>>>>>> 
>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>> [4]
>>>>>>>> 
>>>>>>> 
>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>> [email protected]
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>> 
>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>> tessaract
>>>>>>>>> api.
>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>> think
>>>>>>> we
>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>> or
>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>> for
>>>>>>> a
>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>> suggest
>>>>>>>>> if you have a better approach
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>> 
>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>> [2]
>>>>>>>>> 
>>>>>>> 
>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>> [3]
>>>>>>>>> 
>>>>>>> 
>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>> component,
>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>> Extractor".
>>>>>>>>>> 
>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>> clear
>>>>>>>>>> where the process starts.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Dimuthu
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>> might
>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>> TesseractOCREngine
>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>> lives
>>>>>>> in
>>>>>>>>>> a
>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>>> like an
>>>>>>>>>>>> API)?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: "John Hewson" <[email protected]>
>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>> Introduction
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>> page
>>>>>>>>>>>> rotation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>> have
>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>> glyphs. We
>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>> and
>>>>>>>>>> some
>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>> those
>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>> extracted
>>>>>>>>>> using
>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>> OCR on
>>>>>>>>>>>> those
>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>> wrong.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>> extract
>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>>> are
>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>> is
>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>> extracted
>>>>>>>>>> with
>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>> letters.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>> Configurations
>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>> code.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>> new
>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>> folder
>>>>>>> of
>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>> project
>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>> the
>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>> a
>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>> Tilman
>>>>>>>>>> said
>>>>>>>>>>>> I
>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>> other
>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>> [email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>> PDFToText
>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>> as
>>>>>>> the
>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>> and
>>>>>>> I
>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>> jars
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>> look
>>>>>>>>>> into
>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>> them
>>>>>>>>>> in
>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>>> you
>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>> managed to
>>>>>>>>>> do
>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>> you a
>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>> [email protected]
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>> PDFBox
>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>> details
>>>>>>>>>> the
>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>> are
>>>>>>>>>> all
>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>> PageDrawer
>>>>>>>>>>>> class
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>> (e.g.
>>>>>>>>>> one
>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>> text
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>> reading
>>>>>>>>>>>> order
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>>> like
>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>> any
>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>> Undergraduate
>>>>>>> at
>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>> 2013
>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>> image
>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>> GSoC
>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>> me. In
>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to