Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Sun, 09 Mar 2014 09:09:38 -0700

Hi John,
I finished basic implementation of JNI wrapper for Tesseract. Now it can be
build using maven. Some useful methods that are needed to do basic OCR were
implemented.


I went through PDFBox code several times and got couple of issues that are
needed to be clarified

1. What is the task of processStream method in PDFTextStripper class line
456 : processStream( page.findResources(), content, page.findCropBox(),
page.findRotation() );

2. Say I need to extract images and it's metadata from a pdf. What is the
better approach to do it?

Thanks
Dimuthu


On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
<[email protected]>wrote:

> Hi John
> I refactored Tesseract JNI code to support maven build. To create the JNI
> library I added pre-built static libraries of Tesseract and Leptonica to
> resources folder[2]. For now it includes librararies supported for mac. But
> we can easily add both windows and linux libraries. After "mvn clean
> install", the jar is created under target folder. Now all setting up is
> done. What remains is implementing those native methods in tessbaseapi.cpp
> [3]. Hope to finish it asap. Please let me know if there is any concern
> about project structure.
>
> [1] https://github.com/DImuthuUpe/Tesseract-API.git
> [2]
> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
> [3]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>
> Thanks
> Dimuthu
>
>
> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>
>> Dimuthu
>>
>> > There is a lot of code
>> > fractions in current android jni wrapper which use "(jint)somePointer"
>> > casting which will create terrible memory leaks in 64 bit environments
>> > because ponters are 64 bit. So I believe writing it from the beginning
>> is
>> > much better.
>>
>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>> support
>> 64-bit JVMs.
>>
>> > we can use
>> > the static library of Leptonica (I did and it worked nicely). I think
>> it is
>> > not a issue to use it's static library because both Tesseract and
>> Leptonica
>> > is under apache licence.
>>
>> Sounds good, I found the following in the README:
>>
>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>> without Leptonica.
>>
>> Which makes sense.
>>
>> -- John
>>
>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]>
>> wrote:
>>
>> > Hi John,
>> > +1 for you suggestion about converting image <=> byte array at java
>> side.
>> > It reduces lot of complexities. I don't know whether you have noticed or
>> > not, jint data type in jni is a 32bit integer type. I noticed it in my
>> Mac
>> > but don't know about other operating systems.
>> >
>> > Leptonica is the image processing library for Tesseract [1]. What
>> tesseract
>> > do is using image processing algorithms in Leptonica to implement its
>> OCR
>> > algorithms. This [2] is the responsible .cpp file to create Tesseract
>> API.
>> > You can see it includes allheaders.h header file which is the main
>> header
>> > file of Leptonoca. So I think it is a must to build Leptonica first and
>> > link it when we build Tesseract. This is not a big problem if we can use
>> > the static library of Leptonica (I did and it worked nicely). I think
>> it is
>> > not a issue to use it's static library because both Tesseract and
>> Leptonica
>> > is under apache licence.
>> >
>> > I'm working on the maven implementation you have mentioned and will get
>> > back to you soon.
>> >
>> > Thanks
>> > Dimuthu
>> >
>> >
>> > [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>> > [2]
>> >
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>> >
>> >
>> > On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote:
>> >
>> >> Hi Dimuthu,
>> >>
>> >> 1,2,3:
>> >>
>> >> Feel free to write your own Tesseract binding or port the existing
>> code as
>> >> you see fit.
>> >> The JNI binding should be minimal, only the methods you require need
>> to be
>> >> wrapped.
>> >> Also, don't forget that some of the interop can be done in Java, for
>> >> example if it is easier
>> >> to convert a BufferedImage to a byte array in Java then do it there and
>> >> pass the result
>> >> to JNI rather than writing lots of JNI C++ to achieve the same result.
>> >>
>> >> Your GitHub repo looks like a good start, I can make comments there as
>> >> things progress.
>> >>
>> >> Is it possible to build Tesseract without leptonica? I was under the
>> >> impression that it was
>> >> used for image i/o only, but I may be misinformed.
>> >>
>> >> 4:  The native platform library should be built as part of the Maven
>> build
>> >> for the Tesseract
>> >> wrapper which can be a separate project. The output can be a jar file
>> >> which contains the
>> >> native binaries. It should be possible for the jar to contain prebuilt
>> >> binaries for all platforms
>> >> but this is something we can worry about later. Right now the goal
>> should
>> >> be to build a jar
>> >> containing just the current platform's native binary and any Java
>> wrapper
>> >> code.
>> >>
>> >> -- John
>> >>
>> >> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]>
>> >> wrote:
>> >>
>> >>> Hi John,
>> >>>
>> >>> I tried to reuse that android jni wrapper for tesseract. Here is my
>> >>> observation
>> >>>
>> >>> 1. This wrapper heavily depends on android image libraries.
>> >>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>> >>>
>> >>> 2. But I can understand underlying logic in each function. Basically
>> what
>> >>> it does is mapping between tesseract api functions [2] with java
>> methods.
>> >>> In between it does to some image <=> byte array like conversions by
>> using
>> >>> that bitmap libraries in Android
>> >>>
>> >>> 3. There are two ways. 1: We can port it's code to make compatible
>> with
>> >> our
>> >>> environments(linux,windows and mac) which is really painful. Also it
>> will
>> >>> cause memory leaks. 2: We can use only it's function signatures and
>> >>> implement using our codes
>> >>>
>> >>> I think 2nd solution is better because we need only few operations to
>> be
>> >>> done using tesseract library. I have created a github repo [3] for
>> this.
>> >>> It's still not finished. I need to add some make files and build
>> files to
>> >>> make it run properly. And also I need to implement those wrapper
>> >> functions
>> >>> [3]. This may take some time.
>> >>>
>> >>> 4. Because we are calling native libraries we need different builds of
>> >>> tesseract and leptonica libraries for each platform (dll for windows,
>> so
>> >>> for linux, dylib for mac). So we may need to build those libraries at
>> the
>> >>> time we build pdfbox project. Or we can pre build those libraries and
>> add
>> >>> them to the project as .dll, .so or .dylib format. What is the
>> preferred
>> >>> way?
>> >>>
>> >>> [1]
>> >>>
>> >>
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>> >>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>> >>> [3] https://github.com/DImuthuUpe/Tesseract-API
>> >>> [4]
>> >>>
>> >>
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>> >>>
>> >>> Thanks
>> >>> Dimuthu
>> >>>
>> >>>
>> >>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>> >> [email protected]
>> >>>> wrote:
>> >>>
>> >>>> I updated necessary changes to the document [1]
>> >>>>
>> >>>> For last two days I had a deep look at this [2] jni wrapper for
>> >> tessaract
>> >>>> api.
>> >>>> Unfortunately this has been designed for Android environment so I
>> think
>> >> we
>> >>>> need to write our own make files to build this in to a dll(windows)
>> or
>> >>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>> for
>> >> a
>> >>>> way to convert it to a make file that we can run on console. Please
>> >> suggest
>> >>>> if you have a better approach
>> >>>>
>> >>>> [1]
>> >>>>
>> >>
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>> >>>> [2]
>> >>>>
>> >>
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>> >>>> [3]
>> >>>>
>> >>
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>> >>>>
>> >>>>
>> >>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]>
>> wrote:
>> >>>>
>> >>>>> This is a good start. However, there is no need for the Adder
>> >> component,
>> >>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>> >> Extractor".
>> >>>>>
>> >>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>> clear
>> >>>>> where the process starts.
>> >>>>>
>> >>>>> -- John
>> >>>>>
>> >>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>> [email protected]>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>> >>>>>>
>> >>>>>> [1]
>> >>>>>>
>> >>>>>
>> >>
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Dimuthu
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]>
>> >> wrote:
>> >>>>>>
>> >>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>> >> might
>> >>>>>>> use an interface, e.g. OCREngine and there will be a
>> >> TesseractOCREngine
>> >>>>>>> class somewhere which provides the required functionality and
>> lives
>> >> in
>> >>>>> a
>> >>>>>>> separate jar file.
>> >>>>>>>
>> >>>>>>> -- John
>> >>>>>>>
>> >>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]>
>> >> wrote:
>> >>>>>>>>
>> >>>>>>>> So do you need to embed those new functionalities into existing
>> >>>>>>> PDFtoText algorithms or package them as a new sub system(something
>> >>>>> like an
>> >>>>>>> API)?
>> >>>>>>>>
>> >>>>>>>> -----Original Message-----
>> >>>>>>>> From: "John Hewson" <[email protected]>
>> >>>>>>>> Sent: 26/02/2014 07:38
>> >>>>>>>> To: "[email protected]" <[email protected]>
>> >>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>> >>>>>>> Introduction
>> >>>>>>>>
>> >>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>> >> page
>> >>>>>>> rotation.
>> >>>>>>>>
>> >>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>> have
>> >>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>> >>>>> glyphs. We
>> >>>>>>> could OCR the glyphs to repair the encoding.
>> >>>>>>>>
>> >>>>>>>> -- John
>> >>>>>>>>
>> >>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>> >>>>> [email protected]>
>> >>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi John,
>> >>>>>>>>> Thanks for the explanation.
>> >>>>>>>>> Let's say there is a pdf with both text in extractable format
>> and
>> >>>>> some
>> >>>>>>>>> images with text(Scanned images). In that case first we extract
>> >> those
>> >>>>>>>>> extractable content using PDFBox algorithms and rest is
>> extracted
>> >>>>> using
>> >>>>>>>>> OCR. Finally we pack both results together and give output as
>> >>>>>>> PDFToText. Am
>> >>>>>>>>> I correct? What do you mean by "location data"?
>> >>>>>>>>>
>> >>>>>>>>> Thanks
>> >>>>>>>>> Dimuthu
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>> [email protected]>
>> >>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. What is called "glyphs" ?
>> >>>>>>>>>>
>> >>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>> >>>>>>>>>>
>> >>>>>>>>>>> 2. What is the main requirement of this project?
>> >>>>>>>>>>> As far as I understood, first we need to generate an image of
>> >>>>>>>>>>> malformed pdfs from
>> >>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>> >>>>>>> accurate
>> >>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>> OCR on
>> >>>>>>> those
>> >>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>> wrong.
>> >>>>>>>>>>
>> >>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>> >>>>>>> (PDFToText).
>> >>>>>>>>>> The goal of
>> >>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>> >>>>> extract
>> >>>>>>>>>> text from areas of the
>> >>>>>>>>>> document where the text is embedded as an image. Such PDF files
>> >> are
>> >>>>>>>>>> typically generated by
>> >>>>>>>>>> scanners or fax machines. There is also another case where OCR
>> is
>> >>>>>>> useful:
>> >>>>>>>>>> some fonts embedded
>> >>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>> extracted
>> >>>>> with
>> >>>>>>>>>> PDFToText the result is
>> >>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>> >> letters.
>> >>>>>>>>>>
>> >>>>>>>>>> Instead of:
>> >>>>>>>>>> PDF => Image => OCR => Text
>> >>>>>>>>>>
>> >>>>>>>>>> We want to do:
>> >>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>> >>>>>>>>>>
>> >>>>>>>>>> -- John
>> >>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>> >>>>>>>>>> [email protected]
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Ok fixed. This is what I did
>> >>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>> Configurations
>> >>>>>>>>>> ->Source
>> >>>>>>>>>>>> ->Add -> Project
>> >>>>>>>>>>>> Then I selected PDFBox project.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks
>> >>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>> >>>>>>>>>>>> [email protected]> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>> >>>>>>>>>> application
>> >>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>> code.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>> new
>> >>>>>>>>>> PDPage();document.addPage( blankPage
>> >>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Then I need to add those jar files generated in target
>> folder
>> >> of
>> >>>>>>> PDFBox
>> >>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>> project
>> >>>>> from
>> >>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>  the
>> >>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>> a
>> >>>>>>>>>> reference to
>> >>>>>>>>>>>>> it's sources because I directly used generated jars. As
>> Tilman
>> >>>>> said
>> >>>>>>> I
>> >>>>>>>>>> built
>> >>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>> >> other
>> >>>>>>>>>> projects
>> >>>>>>>>>>>>> other than adding those jar files to build path.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>> >> [email protected]>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>> >> PDFToText
>> >>>>>>> class
>> >>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>> as
>> >> the
>> >>>>>>>>>> command
>> >>>>>>>>>>>>>> line argument.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> -- John
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>> >>>>>>>>>> [email protected]>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi John,
>> >>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>> >>>>> managed to
>> >>>>>>>>>>>>>> build
>> >>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>> and
>> >> I
>> >>>>>>> got a
>> >>>>>>>>>>>>>> rough
>> >>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>> >> jars
>> >>>>> in
>> >>>>>>>>>>>>>> target
>> >>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>> >>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>> look
>> >>>>> into
>> >>>>>>> code
>> >>>>>>>>>>>>>>> specially how those processXXX() methods work in
>> >>>>> PDFTextStripper
>> >>>>>>>>>> class.
>> >>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>> >> them
>> >>>>> in
>> >>>>>>>>>> debug
>> >>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>> >> you
>> >>>>>>> follow
>> >>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>> order to do such task?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>> managed to
>> >>>>> do
>> >>>>>>> some
>> >>>>>>>>>>>>>> OCR
>> >>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>> >>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>> >> you a
>> >>>>>>> mail.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks
>> >>>>>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>> >>>>> [email protected]
>> >>>>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Hi Dimuthu
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> The PDFBox website can be found at
>> >> http://pdfbox.apache.org/it
>> >>>>>>>>>>>>>> contains
>> >>>>>>>>>>>>>>>> a basic overview of the project
>> >>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>> >> PDFBox
>> >>>>> for
>> >>>>>>>>>>>>>> yourself.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>> details
>> >>>>> the
>> >>>>>>> only
>> >>>>>>>>>>>>>>>> thoughts so far regarding it.
>> >>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>> are
>> >>>>> all
>> >>>>>>>>>> under
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> Apache license, which is a
>> >>>>>>>>>>>>>>>> requirement.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>> PageDrawer
>> >>>>>>> class
>> >>>>>>>>>> to
>> >>>>>>>>>>>>>> see
>> >>>>>>>>>>>>>>>> how text and images are
>> >>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>> (e.g.
>> >>>>> one
>> >>>>>>>>>> glyph,
>> >>>>>>>>>>>>>>>> word, or sentence at a time) with
>> >>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>> >> text
>> >>>>> is
>> >>>>>>>>>>>>>> currently
>> >>>>>>>>>>>>>>>> extracted, take a look at how
>> >>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>> reading
>> >>>>>>> order
>> >>>>>>>>>> and
>> >>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>> >>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>> >> like
>> >>>>>>> HTML
>> >>>>>>>>>> -
>> >>>>>>>>>>>>>>>> which is why extracting text can be so
>> >>>>>>>>>>>>>>>> difficult sometimes.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>> >>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>> any
>> >>>>>>>>>> questions.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> -- John
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>> >>>>>>>>>> [email protected]
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>> Undergraduate
>> >> at
>> >>>>>>>>>>>>>> University
>> >>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>> 2013
>> >>>>> with
>> >>>>>>>>>>>>>> Apache
>> >>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>> image
>> >>>>>>>>>> processing
>> >>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>> >> GSoC
>> >>>>>>> 2014
>> >>>>>>>>>>>>>> project
>> >>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>> me. In
>> >>>>>>>>>>>>>> university
>> >>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>> >>>>> wrote a
>> >>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>> >>>>>>>>>>>>>> systems(attached). Can
>> >>>>>>>>>>>>>>>> you please suggest me where to start learning about
>> PDFBox?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thank you
>> >>>>>>>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>> Regards
>> >>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>> Regards
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> --
>> >>>>>>>>>>>>> Regards
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> --
>> >>>>>>>>>>>> Regards
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Regards
>> >>>>>>>>>>>
>> >>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>> Undergraduate
>> >>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>
>> >>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Regards
>> >>>>>>>>>
>> >>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>> Undergraduate
>> >>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>
>> >>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Regards
>> >>>>>>
>> >>>>>> W.Dimuthu Upeksha
>> >>>>>> Undergraduate
>> >>>>>> Department of Computer Science And Engineering
>> >>>>>>
>> >>>>>> University of Moratuwa, Sri Lanka
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Regards
>> >>>>
>> >>>> W.Dimuthu Upeksha
>> >>>> Undergraduate
>> >>>> Department of Computer Science And Engineering
>> >>>>
>> >>>> University of Moratuwa, Sri Lanka
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Regards
>> >>>
>> >>> W.Dimuthu Upeksha
>> >>> Undergraduate
>> >>> Department of Computer Science And Engineering
>> >>>
>> >>> University of Moratuwa, Sri Lanka
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards
>> >
>> > W.Dimuthu Upeksha
>> > Undergraduate
>> > Department of Computer Science And Engineering
>> >
>> > University of Moratuwa, Sri Lanka
>>
>>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to