Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Tue, 25 Feb 2014 01:08:27 -0800

Hi John,
I got a couple of questions.
1. What is called "glyphs" ?
2. What is the main requirement of this project?
As far as I understood, first we need to generate an image of
malformed pdfs from
PDFBox and then we need to do processing using OCR for further accurate
results.  But the problem is, why shouldn't we directly do OCR on those
PDFs without getting output from PDFBox? Correct me if I'm wrong.


Thanks
Dimuthu


On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <[email protected]
> wrote:

> Ok fixed. This is what I did
> Right click on the new project ->Debug As-> Debug Configurations ->Source
> ->Add -> Project
> Then I selected PDFBox project.
>
> Thanks
> Dimuthu
>
>
> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> [email protected]> wrote:
>
>> I'm using eclipse. This is what I want. I created a new Java application
>> project (say TestPDFBox) with a main class with following code.
>>
>> PDDocument document = new PDDocument();PDPage blankPage = new 
>> PDPage();document.addPage( blankPage 
>> );document.save("BlankPage.pdf");document.close();
>>
>> Then I need to add those jar files generated in target folder of PDFBox
>> to build path of my new project (I did build the PDFBox project from
>> source). That is what I did. But let's say I need to check  the
>> functionality of document.save("") method. But I don't have a reference to
>> it's sources because I directly used generated jars. As Tilman said I built
>> PDFBox from sources but I don't know a proper way to use it other projects
>> other than adding those jar files to build path.
>>
>>
>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> wrote:
>>
>>> Which IDE are you using? You should be able to run the PDFToText class
>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the command
>>> line argument.
>>>
>>> -- John
>>>
>>> > On 24 Feb 2014, at 22:38, DImuthu Upeksha <[email protected]>
>>> wrote:
>>> >
>>> > Hi John,
>>> > Thanks for the reply. Yes I checked out PDFBox code and managed to
>>> build
>>> > code successfully. I looked at the classes you mentioned and I got a
>>> rough
>>> > idea about how they are working. To check them I used the jars in
>>> target
>>> > folder to my separate java project. I tried samples in
>>> > http://pdfbox.apache.org/cookbook/. I need to further look into code
>>> > specially how those processXXX() methods work in PDFTextStripper class.
>>> > What I usually do is adding some berakpoints and checking them in debug
>>> > windows. But using jars it's not possible. What is the way you follow
>>> in
>>> > order to do such task?
>>> >
>>> > As well I installed tesseract in to my machine and managed to do some
>>> OCR
>>> > stuff also. That's a cool tool which works fine.
>>> > I'm still learning the code. If I get any issue I'll drop you a mail.
>>> >
>>> > Thanks
>>> > Dimuthu
>>> >
>>> >
>>> >> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi Dimuthu
>>> >>
>>> >> The PDFBox website can be found at http://pdfbox.apache.org/ it
>>> contains
>>> >> a basic overview of the project
>>> >> and details on how to obtain the source code and build PDFBox for
>>> yourself.
>>> >>
>>> >> Currently we do not perform any OCR and PDFBOX-1912 details the only
>>> >> thoughts so far regarding it.
>>> >> Note that the OCR libraries mentioned in the JIRA issue are all under
>>> the
>>> >> Apache license, which is a
>>> >> requirement.
>>> >>
>>> >> Once you have the source code, take a look at the PageDrawer class to
>>> see
>>> >> how text and images are
>>> >> rendered. We want someone to interface at a low-level (e.g. one glyph,
>>> >> word, or sentence at a time) with
>>> >> an OCR engine. Also look at PDFTextStripper which is how text is
>>> currently
>>> >> extracted, take a look at how
>>> >> we have to go to great length to sort text back into reading order and
>>> >> infer the placement of diacritics - PDF
>>> >> is fundamentally a visual format, not a structured format like HTML -
>>> >> which is why extracting text can be so
>>> >> difficult sometimes.
>>> >>
>>> >> The full PDF Reference document can be found at:
>>> >>
>>> >>
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>> >>
>>> >> Feel free to discuss specifics of your proposal or ask any questions.
>>> >>
>>> >> Thanks,
>>> >>
>>> >> -- John
>>> >>
>>> >> On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]
>>> >
>>> >> wrote:
>>> >>
>>> >>> Hi,
>>> >>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>> University
>>> >> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with
>>> Apache
>>> >> ISIS [1] project. I'm very much interested in OCR and image processing
>>> >> stuff. So I would like to select this project idea as my GSoC 2014
>>> project
>>> >> because I feel like it is the best suited project for me. In
>>> university
>>> >> also we have done some research in OCR area and our group wrote a
>>> >> literature review about increasing efficiency of OCR
>>> systems(attached). Can
>>> >> you please suggest me where to start learning about PDFBox?
>>> >>>
>>> >>> [1]
>>> >>
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>> >>>
>>> >>> Thank you
>>> >>> Dimuthu
>>> >>>
>>> >>> --
>>> >>> Regards
>>> >>> W.Dimuthu Upeksha
>>> >>> Undergraduate
>>> >>> Department of Computer Science And Engineering
>>> >>> University of Moratuwa, Sri Lanka
>>> >
>>> >
>>> > --
>>> > Regards
>>> >
>>> > W.Dimuthu Upeksha
>>> > Undergraduate
>>> > Department of Computer Science And Engineering
>>> >
>>> > University of Moratuwa, Sri Lanka
>>>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>>
>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to