Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Wed, 19 Mar 2014 10:40:12 -0700

Hi Dimuthu

This is a good start. One point to address is that a String in Java is encoded 
as UTF-16,
so your getUTF8Text() method must be doing something wrong. It should perform
a UTF-16 conversion internally and be renamed to getText(). You can probably do 
the
conversion in Java rather than in C++ (or maybe Tesseract can return UTF-16?).


Cheers

-- John

On 16 Mar 2014, at 06:15, DImuthu Upeksha <[email protected]> wrote:

> Hi John,
> 
> For now I'm using those methods to debug the wrapper. I'll remove
> those methods after I finished testing it.
> 
> I started implementing OCR-plugin [1] for PDFBox. Currently it
> satisfies basic requirements such as getting word+location data [2].
> Please have a look at that and let me know if any changes are
> required.
> 
> [1] https://github.com/DImuthuUpe/OCR-Plugin
> [2] 
> https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java
> 
> Thanks
> Dimuthu
> 
> On Fri, Mar 14, 2014 at 12:09 AM, John Hewson <[email protected]> wrote:
>> Thanks, I saw your new refactoring too, it's good. Now the following methods 
>> are no longer needed:
>> 
>> public void setImagePath(String path)
>> public void setImage(byte[] imagedata, int width, int height, int bpp,int 
>> bpl)
>> 
>> Cheers
>> 
>> -- John
>> 
>> On 11 Mar 2014, at 22:58, DImuthu Upeksha <[email protected]> wrote:
>> 
>>> Hi John,
>>> Yes. I implemented a new method to accept byte streams of the image as
>>> an input. We directly can't send BufferedImage objects to native side.
>>> So what I did is converting buffered image into a byte array and
>>> passed it in to native side. At the native side it again converts in
>>> to compatible format. With that request we need to pass some metadata
>>> of byte stream like image width, height, bytes per pixel and bytes per
>>> row. I checked it with this [2] test case and it works fine.
>>> 
>>> [1] 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
>>> [2] 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <[email protected]> wrote:
>>>> Hi Dimuthu
>>>> 
>>>> The Tesseract wrapper needs to take its input from a BufferedImage rather 
>>>> than reading a file from disk, so instead of:
>>>> 
>>>> api.setImagePath("test.tif");
>>>> 
>>>> What we need is:
>>>> 
>>>> BufferedImage image = ImageIO.read(new File("test.tif"));
>>>> api.setImagePath(image);
>>>> 
>>>> Because this will let us used the BufferedImage generated by PDFRenderer 
>>>> without round-tripping to the disk.
>>>> 
>>>> -- John
>>>> 
>>>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <[email protected]> 
>>>> wrote:
>>>> 
>>>>> Hi John,
>>>>> Thanks for the guidance.
>>>>> I did a small analysis of the accuracy and performance of new
>>>>> Tesseract wrapper. I used this [1] image as the input image and got
>>>>> following data [2] after OCR. First line is the recognised word
>>>>> followed by location details (bounding box) of the word. I think these
>>>>> details are pretty much enough for our task. Now what remaining is
>>>>> converting pdf file into a image as you have mentioned. These days I'm
>>>>> working on it.
>>>>> 
>>>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>>>>> [2] https://gist.github.com/DImuthuUpe/9491660
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <[email protected]> wrote:
>>>>>> Dimuthu,
>>>>>> 
>>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it 
>>>>>>> can be
>>>>>>> build using maven. Some useful methods that are needed to do basic OCR 
>>>>>>> were
>>>>>>> implemented.
>>>>>> 
>>>>>> Great, it's looking good, nice and clean.
>>>>>> 
>>>>>>> 1. What is the task of processStream method in PDFTextStripper class 
>>>>>>> line
>>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>>> page.findRotation() );
>>>>>> 
>>>>>> A PDF file is made up of pages, each of which contains a "content 
>>>>>> stream". This content stream contains a list of drawing commands such as 
>>>>>> "move to 10,15" or "write the word `foo`", these are called operators. 
>>>>>> The processStream function reads the stream for the current page and 
>>>>>> executes each of the operators. The operators themselves are implemented 
>>>>>> each in their own class which is a subclass of PDFOperator. The 
>>>>>> constructor of PDFStreamEngine creates the operator classes using 
>>>>>> reflection, which is rather odd and I'm not sure why this design was 
>>>>>> chosen. The operators used by PDFTextStripper can be found in 
>>>>>> org/apache/pdfbox/resources/PDFTextStripper.properties
>>>>>> 
>>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is 
>>>>>>> the better approach to do it?
>>>>>> 
>>>>>> You could subclass PDFTextStripper and override the startDocument method 
>>>>>> and use it to create a PDFRenderer and store it in a field. Then 
>>>>>> override the processPage method and use the previously created 
>>>>>> PDFRenderer to render the current page to a buffered image and perform 
>>>>>> OCR on the image. Once you have the OCR text + positions, instead of 
>>>>>> calling processStream you can call processTextPosition once for each 
>>>>>> character + position.
>>>>>> 
>>>>>> The PDFRenderer class was just added to the trunk, so make sure you do 
>>>>>> an "svn update". Let me know if you need me to change PDFTextStripper to 
>>>>>> make it easier to subclass.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi John,
>>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it 
>>>>>>> can be
>>>>>>> build using maven. Some useful methods that are needed to do basic OCR 
>>>>>>> were
>>>>>>> implemented.
>>>>>>> 
>>>>>>> I went through PDFBox code several times and got couple of issues that 
>>>>>>> are
>>>>>>> needed to be clarified
>>>>>>> 
>>>>>>> 1. What is the task of processStream method in PDFTextStripper class 
>>>>>>> line
>>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>>> page.findRotation() );
>>>>>>> 
>>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is 
>>>>>>> the
>>>>>>> better approach to do it?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>>>>> <[email protected]>wrote:
>>>>>>> 
>>>>>>>> Hi John
>>>>>>>> I refactored Tesseract JNI code to support maven build. To create the 
>>>>>>>> JNI
>>>>>>>> library I added pre-built static libraries of Tesseract and Leptonica 
>>>>>>>> to
>>>>>>>> resources folder[2]. For now it includes librararies supported for 
>>>>>>>> mac. But
>>>>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>>>>> install", the jar is created under target folder. Now all setting up is
>>>>>>>> done. What remains is implementing those native methods in 
>>>>>>>> tessbaseapi.cpp
>>>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>>>>> about project structure.
>>>>>>>> 
>>>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>>>>> [2]
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>>>>> [3]
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>>> There is a lot of code
>>>>>>>>>> fractions in current android jni wrapper which use 
>>>>>>>>>> "(jint)somePointer"
>>>>>>>>>> casting which will create terrible memory leaks in 64 bit 
>>>>>>>>>> environments
>>>>>>>>>> because ponters are 64 bit. So I believe writing it from the 
>>>>>>>>>> beginning
>>>>>>>>> is
>>>>>>>>>> much better.
>>>>>>>>> 
>>>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>>>>> support
>>>>>>>>> 64-bit JVMs.
>>>>>>>>> 
>>>>>>>>>> we can use
>>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>>> it is
>>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>>> Leptonica
>>>>>>>>>> is under apache licence.
>>>>>>>>> 
>>>>>>>>> Sounds good, I found the following in the README:
>>>>>>>>> 
>>>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer 
>>>>>>>>> compiles
>>>>>>>>> without Leptonica.
>>>>>>>>> 
>>>>>>>>> Which makes sense.
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi John,
>>>>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>>>>> side.
>>>>>>>>>> It reduces lot of complexities. I don't know whether you have 
>>>>>>>>>> noticed or
>>>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in 
>>>>>>>>>> my
>>>>>>>>> Mac
>>>>>>>>>> but don't know about other operating systems.
>>>>>>>>>> 
>>>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>>>>> tesseract
>>>>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>>>>> OCR
>>>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>>>>> API.
>>>>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>>>>> header
>>>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first 
>>>>>>>>>> and
>>>>>>>>>> link it when we build Tesseract. This is not a big problem if we can 
>>>>>>>>>> use
>>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>>> it is
>>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>>> Leptonica
>>>>>>>>>> is under apache licence.
>>>>>>>>>> 
>>>>>>>>>> I'm working on the maven implementation you have mentioned and will 
>>>>>>>>>> get
>>>>>>>>>> back to you soon.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>>>>> [2]
>>>>>>>>>> 
>>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Dimuthu,
>>>>>>>>>>> 
>>>>>>>>>>> 1,2,3:
>>>>>>>>>>> 
>>>>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>>>>> code as
>>>>>>>>>>> you see fit.
>>>>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>>>>> to be
>>>>>>>>>>> wrapped.
>>>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>>>>> example if it is easier
>>>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there 
>>>>>>>>>>> and
>>>>>>>>>>> pass the result
>>>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same 
>>>>>>>>>>> result.
>>>>>>>>>>> 
>>>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there 
>>>>>>>>>>> as
>>>>>>>>>>> things progress.
>>>>>>>>>>> 
>>>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>>>>> impression that it was
>>>>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>>>> 
>>>>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>>>>> build
>>>>>>>>>>> for the Tesseract
>>>>>>>>>>> wrapper which can be a separate project. The output can be a jar 
>>>>>>>>>>> file
>>>>>>>>>>> which contains the
>>>>>>>>>>> native binaries. It should be possible for the jar to contain 
>>>>>>>>>>> prebuilt
>>>>>>>>>>> binaries for all platforms
>>>>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>>>>> should
>>>>>>>>>>> be to build a jar
>>>>>>>>>>> containing just the current platform's native binary and any Java
>>>>>>>>> wrapper
>>>>>>>>>>> code.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha 
>>>>>>>>>>> <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> 
>>>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>>>>> observation
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this 
>>>>>>>>>>>> library.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. But I can understand underlying logic in each function. 
>>>>>>>>>>>> Basically
>>>>>>>>> what
>>>>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>>>>> methods.
>>>>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>>>>> using
>>>>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>>>>> with
>>>>>>>>>>> our
>>>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also 
>>>>>>>>>>>> it
>>>>>>>>> will
>>>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>>>>> implement using our codes
>>>>>>>>>>>> 
>>>>>>>>>>>> I think 2nd solution is better because we need only few operations 
>>>>>>>>>>>> to
>>>>>>>>> be
>>>>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>>>>> this.
>>>>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>>>>> files to
>>>>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>>>>> functions
>>>>>>>>>>>> [3]. This may take some time.
>>>>>>>>>>>> 
>>>>>>>>>>>> 4. Because we are calling native libraries we need different 
>>>>>>>>>>>> builds of
>>>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for 
>>>>>>>>>>>> windows,
>>>>>>>>> so
>>>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries 
>>>>>>>>>>>> at
>>>>>>>>> the
>>>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries 
>>>>>>>>>>>> and
>>>>>>>>> add
>>>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>>>>> preferred
>>>>>>>>>>>> way?
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>>>>> [4]
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>>>>> tessaract
>>>>>>>>>>>>> api.
>>>>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>>>>> think
>>>>>>>>>>> we
>>>>>>>>>>>>> need to write our own make files to build this in to a 
>>>>>>>>>>>>> dll(windows)
>>>>>>>>> or
>>>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm 
>>>>>>>>>>>>> searching
>>>>>>>>> for
>>>>>>>>>>> a
>>>>>>>>>>>>> way to convert it to a make file that we can run on console. 
>>>>>>>>>>>>> Please
>>>>>>>>>>> suggest
>>>>>>>>>>>>> if you have a better approach
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>>>>> [3]
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>>>>> component,
>>>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>>>>> Extractor".
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>>>>> clear
>>>>>>>>>>>>>> where the process starts.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so 
>>>>>>>>>>>>>>>> PDFToText
>>>>>>>>>>> might
>>>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>>>>> lives
>>>>>>>>>>> in
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> So do you need to embed those new functionalities into 
>>>>>>>>>>>>>>>>> existing
>>>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub 
>>>>>>>>>>>>>>>> system(something
>>>>>>>>>>>>>> like an
>>>>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: "John Hewson" <[email protected]>
>>>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project 
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates 
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>> page
>>>>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>>>>> have
>>>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>>>>> and
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we 
>>>>>>>>>>>>>>>>>> extract
>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>>>>> extracted
>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image 
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for 
>>>>>>>>>>>>>>>>>>>> further
>>>>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>>>>> OCR on
>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR 
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> extract
>>>>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF 
>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where 
>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>> is
>>>>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>>>>> extracted
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>>>>> letters.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => 
>>>>>>>>>>>>>>>>>>> Text
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>>>>> Configurations
>>>>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new 
>>>>>>>>>>>>>>>>>>>>>> Java
>>>>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>>>>> code.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>>>>> new
>>>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>>>>> folder
>>>>>>>>>>> of
>>>>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>>>>> project
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to 
>>>>>>>>>>>>>>>>>>>>>> check
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't 
>>>>>>>>>>>>>>>>>>>>>> have
>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>>>>> Tilman
>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use 
>>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>>>>> PDFToText
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file 
>>>>>>>>>>>>>>>>>>>>>>> path
>>>>>>>>> as
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you 
>>>>>>>>>>>>>>>>>>>>>>>> mentioned
>>>>>>>>> and
>>>>>>>>>>> I
>>>>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used 
>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>> jars
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>>>>> look
>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and 
>>>>>>>>>>>>>>>>>>>>>>>> checking
>>>>>>>>>>> them
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the 
>>>>>>>>>>>>>>>>>>>>>>>> way
>>>>>>>>>>> you
>>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>>>>> managed to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll 
>>>>>>>>>>>>>>>>>>>>>>>> drop
>>>>>>>>>>> you a
>>>>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>>>>> details
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA 
>>>>>>>>>>>>>>>>>>>>>>>>> issue
>>>>>>>>> are
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>>>>> PageDrawer
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>>>>> (e.g.
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is 
>>>>>>>>>>>>>>>>>>>>>>>>> how
>>>>>>>>>>> text
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>>>>> reading
>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured 
>>>>>>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>> like
>>>>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>>>>> any
>>>>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>>>>> Undergraduate
>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my 
>>>>>>>>>>>>>>>>>>>>>>>>> GSoC
>>>>>>>>> 2013
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>>>>> image
>>>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as 
>>>>>>>>>>>>>>>>>>>>>>>>> my
>>>>>>>>>>> GSoC
>>>>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>>>>> me. In
>>>>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our 
>>>>>>>>>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> 
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to