Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Thu, 13 Mar 2014 11:41:09 -0700

Thanks, I saw your new refactoring too, it’s good. Now the following methods 
are no longer needed:


public void setImagePath(String path)
public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl)

Cheers

-- John

On 11 Mar 2014, at 22:58, DImuthu Upeksha <[email protected]> wrote:

> Hi John,
> Yes. I implemented a new method to accept byte streams of the image as
> an input. We directly can't send BufferedImage objects to native side.
> So what I did is converting buffered image into a byte array and
> passed it in to native side. At the native side it again converts in
> to compatible format. With that request we need to pass some metadata
> of byte stream like image width, height, bytes per pixel and bytes per
> row. I checked it with this [2] test case and it works fine.
> 
> [1] 
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
> [2] 
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
> 
> Thanks
> Dimuthu
> 
> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <[email protected]> wrote:
>> Hi Dimuthu
>> 
>> The Tesseract wrapper needs to take its input from a BufferedImage rather 
>> than reading a file from disk, so instead of:
>> 
>> api.setImagePath("test.tif");
>> 
>> What we need is:
>> 
>> BufferedImage image = ImageIO.read(new File("test.tif"));
>> api.setImagePath(image);
>> 
>> Because this will let us used the BufferedImage generated by PDFRenderer 
>> without round-tripping to the disk.
>> 
>> -- John
>> 
>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <[email protected]> wrote:
>> 
>>> Hi John,
>>> Thanks for the guidance.
>>> I did a small analysis of the accuracy and performance of new
>>> Tesseract wrapper. I used this [1] image as the input image and got
>>> following data [2] after OCR. First line is the recognised word
>>> followed by location details (bounding box) of the word. I think these
>>> details are pretty much enough for our task. Now what remaining is
>>> converting pdf file into a image as you have mentioned. These days I'm
>>> working on it.
>>> 
>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>>> [2] https://gist.github.com/DImuthuUpe/9491660
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <[email protected]> wrote:
>>>> Dimuthu,
>>>> 
>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can 
>>>>> be
>>>>> build using maven. Some useful methods that are needed to do basic OCR 
>>>>> were
>>>>> implemented.
>>>> 
>>>> Great, it's looking good, nice and clean.
>>>> 
>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>> page.findRotation() );
>>>> 
>>>> A PDF file is made up of pages, each of which contains a "content stream". 
>>>> This content stream contains a list of drawing commands such as "move to 
>>>> 10,15" or "write the word `foo`", these are called operators. The 
>>>> processStream function reads the stream for the current page and executes 
>>>> each of the operators. The operators themselves are implemented each in 
>>>> their own class which is a subclass of PDFOperator. The constructor of 
>>>> PDFStreamEngine creates the operator classes using reflection, which is 
>>>> rather odd and I'm not sure why this design was chosen. The operators used 
>>>> by PDFTextStripper can be found in 
>>>> org/apache/pdfbox/resources/PDFTextStripper.properties
>>>> 
>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the 
>>>>> better approach to do it?
>>>> 
>>>> You could subclass PDFTextStripper and override the startDocument method 
>>>> and use it to create a PDFRenderer and store it in a field. Then override 
>>>> the processPage method and use the previously created PDFRenderer to 
>>>> render the current page to a buffered image and perform OCR on the image. 
>>>> Once you have the OCR text + positions, instead of calling processStream 
>>>> you can call processTextPosition once for each character + position.
>>>> 
>>>> The PDFRenderer class was just added to the trunk, so make sure you do an 
>>>> "svn update". Let me know if you need me to change PDFTextStripper to make 
>>>> it easier to subclass.
>>>> 
>>>> Cheers
>>>> 
>>>> -- John
>>>> 
>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <[email protected]> 
>>>> wrote:
>>>> 
>>>>> Hi John,
>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can 
>>>>> be
>>>>> build using maven. Some useful methods that are needed to do basic OCR 
>>>>> were
>>>>> implemented.
>>>>> 
>>>>> I went through PDFBox code several times and got couple of issues that are
>>>>> needed to be clarified
>>>>> 
>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>> page.findRotation() );
>>>>> 
>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>>>> better approach to do it?
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>> 
>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>>> <[email protected]>wrote:
>>>>> 
>>>>>> Hi John
>>>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>>>> resources folder[2]. For now it includes librararies supported for mac. 
>>>>>> But
>>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>>> install", the jar is created under target folder. Now all setting up is
>>>>>> done. What remains is implementing those native methods in 
>>>>>> tessbaseapi.cpp
>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>>> about project structure.
>>>>>> 
>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>>> [2]
>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>>> [3]
>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>>>> 
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>>> There is a lot of code
>>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>>>> is
>>>>>>>> much better.
>>>>>>> 
>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>>> support
>>>>>>> 64-bit JVMs.
>>>>>>> 
>>>>>>>> we can use
>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>> it is
>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>> Leptonica
>>>>>>>> is under apache licence.
>>>>>>> 
>>>>>>> Sounds good, I found the following in the README:
>>>>>>> 
>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>>>> without Leptonica.
>>>>>>> 
>>>>>>> Which makes sense.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi John,
>>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>>> side.
>>>>>>>> It reduces lot of complexities. I don't know whether you have noticed 
>>>>>>>> or
>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>>>> Mac
>>>>>>>> but don't know about other operating systems.
>>>>>>>> 
>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>>> tesseract
>>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>>> OCR
>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>>> API.
>>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>>> header
>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>>>> link it when we build Tesseract. This is not a big problem if we can 
>>>>>>>> use
>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>> it is
>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>> Leptonica
>>>>>>>> is under apache licence.
>>>>>>>> 
>>>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>>>> back to you soon.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>>> [2]
>>>>>>>> 
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Hi Dimuthu,
>>>>>>>>> 
>>>>>>>>> 1,2,3:
>>>>>>>>> 
>>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>>> code as
>>>>>>>>> you see fit.
>>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>>> to be
>>>>>>>>> wrapped.
>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>>> example if it is easier
>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there 
>>>>>>>>> and
>>>>>>>>> pass the result
>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>>>> 
>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>>>> things progress.
>>>>>>>>> 
>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>>> impression that it was
>>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>> 
>>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>>> build
>>>>>>>>> for the Tesseract
>>>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>>>> which contains the
>>>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>>>> binaries for all platforms
>>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>>> should
>>>>>>>>> be to build a jar
>>>>>>>>> containing just the current platform's native binary and any Java
>>>>>>> wrapper
>>>>>>>>> code.
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi John,
>>>>>>>>>> 
>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>>> observation
>>>>>>>>>> 
>>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>>>> 
>>>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>>>> what
>>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>>> methods.
>>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>>> using
>>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>> 
>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>>> with
>>>>>>>>> our
>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>>>> will
>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>>> implement using our codes
>>>>>>>>>> 
>>>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>>>> be
>>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>>> this.
>>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>>> files to
>>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>>> functions
>>>>>>>>>> [3]. This may take some time.
>>>>>>>>>> 
>>>>>>>>>> 4. Because we are calling native libraries we need different builds 
>>>>>>>>>> of
>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>>>> so
>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>>>> the
>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>>>> add
>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>>> preferred
>>>>>>>>>> way?
>>>>>>>>>> 
>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>>> [4]
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>>> [email protected]
>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>> 
>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>>> tessaract
>>>>>>>>>>> api.
>>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>>> think
>>>>>>>>> we
>>>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>>>> or
>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>>>> for
>>>>>>>>> a
>>>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>>>> suggest
>>>>>>>>>>> if you have a better approach
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>>> [2]
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>>> [3]
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>>> component,
>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>>> Extractor".
>>>>>>>>>>>> 
>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>>> clear
>>>>>>>>>>>> where the process starts.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>>> [email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>>>> might
>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>>> lives
>>>>>>>>> in
>>>>>>>>>>>> a
>>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub 
>>>>>>>>>>>>>> system(something
>>>>>>>>>>>> like an
>>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: "John Hewson" <[email protected]>
>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>>>> page
>>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>>> have
>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>>> and
>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>>>> those
>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>>> extracted
>>>>>>>>>>>> using
>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for 
>>>>>>>>>>>>>>>>>> further
>>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>>> OCR on
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>>> wrong.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>>>> extract
>>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF 
>>>>>>>>>>>>>>>>> files
>>>>>>>>> are
>>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>>>> is
>>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>>> extracted
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>>> letters.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>>> Configurations
>>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new 
>>>>>>>>>>>>>>>>>>>> Java
>>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>>> code.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>>> new
>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>>> folder
>>>>>>>>> of
>>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>>> project
>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>>>> the
>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>>>> a
>>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>>> Tilman
>>>>>>>>>>>> said
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>>>> other
>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>>> PDFToText
>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>>>> as
>>>>>>>>> the
>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>>>> and
>>>>>>>>> I
>>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>>>> jars
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>>> look
>>>>>>>>>>>> into
>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>>>> them
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the 
>>>>>>>>>>>>>>>>>>>>>> way
>>>>>>>>> you
>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>>> managed to
>>>>>>>>>>>> do
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>>>> you a
>>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>>> PDFBox
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>>> details
>>>>>>>>>>>> the
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>>>> are
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>>> PageDrawer
>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>>> (e.g.
>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>>>> text
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>>> reading
>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured 
>>>>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>> like
>>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>>> any
>>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>>> Undergraduate
>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>>>> 2013
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>>> image
>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>>>> GSoC
>>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>>> me. In
>>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our 
>>>>>>>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to