Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Tue, 25 Feb 2014 00:06:13 -0800

Ok fixed. This is what I did
Right click on the new project ->Debug As-> Debug Configurations ->Source
->Add -> Project
Then I selected PDFBox project.


Thanks
Dimuthu


On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <[email protected]
> wrote:

> I'm using eclipse. This is what I want. I created a new Java application
> project (say TestPDFBox) with a main class with following code.
>
> PDDocument document = new PDDocument();PDPage blankPage = new 
> PDPage();document.addPage( blankPage 
> );document.save("BlankPage.pdf");document.close();
>
> Then I need to add those jar files generated in target folder of PDFBox to
> build path of my new project (I did build the PDFBox project from source).
> That is what I did. But let's say I need to check  the functionality of
> document.save("") method. But I don't have a reference to it's sources
> because I directly used generated jars. As Tilman said I built PDFBox from
> sources but I don't know a proper way to use it other projects other than
> adding those jar files to build path.
>
>
> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> wrote:
>
>> Which IDE are you using? You should be able to run the PDFToText class
>> (in pdfbox-tools) using your IDE and pass a PDF file path as the command
>> line argument.
>>
>> -- John
>>
>> > On 24 Feb 2014, at 22:38, DImuthu Upeksha <[email protected]>
>> wrote:
>> >
>> > Hi John,
>> > Thanks for the reply. Yes I checked out PDFBox code and managed to build
>> > code successfully. I looked at the classes you mentioned and I got a
>> rough
>> > idea about how they are working. To check them I used the jars in target
>> > folder to my separate java project. I tried samples in
>> > http://pdfbox.apache.org/cookbook/. I need to further look into code
>> > specially how those processXXX() methods work in PDFTextStripper class.
>> > What I usually do is adding some berakpoints and checking them in debug
>> > windows. But using jars it's not possible. What is the way you follow in
>> > order to do such task?
>> >
>> > As well I installed tesseract in to my machine and managed to do some
>> OCR
>> > stuff also. That's a cool tool which works fine.
>> > I'm still learning the code. If I get any issue I'll drop you a mail.
>> >
>> > Thanks
>> > Dimuthu
>> >
>> >
>> >> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]>
>> wrote:
>> >>
>> >> Hi Dimuthu
>> >>
>> >> The PDFBox website can be found at http://pdfbox.apache.org/ it
>> contains
>> >> a basic overview of the project
>> >> and details on how to obtain the source code and build PDFBox for
>> yourself.
>> >>
>> >> Currently we do not perform any OCR and PDFBOX-1912 details the only
>> >> thoughts so far regarding it.
>> >> Note that the OCR libraries mentioned in the JIRA issue are all under
>> the
>> >> Apache license, which is a
>> >> requirement.
>> >>
>> >> Once you have the source code, take a look at the PageDrawer class to
>> see
>> >> how text and images are
>> >> rendered. We want someone to interface at a low-level (e.g. one glyph,
>> >> word, or sentence at a time) with
>> >> an OCR engine. Also look at PDFTextStripper which is how text is
>> currently
>> >> extracted, take a look at how
>> >> we have to go to great length to sort text back into reading order and
>> >> infer the placement of diacritics - PDF
>> >> is fundamentally a visual format, not a structured format like HTML -
>> >> which is why extracting text can be so
>> >> difficult sometimes.
>> >>
>> >> The full PDF Reference document can be found at:
>> >>
>> >>
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>> >>
>> >> Feel free to discuss specifics of your proposal or ask any questions.
>> >>
>> >> Thanks,
>> >>
>> >> -- John
>> >>
>> >> On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]>
>> >> wrote:
>> >>
>> >>> Hi,
>> >>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>> University
>> >> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with
>> Apache
>> >> ISIS [1] project. I'm very much interested in OCR and image processing
>> >> stuff. So I would like to select this project idea as my GSoC 2014
>> project
>> >> because I feel like it is the best suited project for me. In university
>> >> also we have done some research in OCR area and our group wrote a
>> >> literature review about increasing efficiency of OCR
>> systems(attached). Can
>> >> you please suggest me where to start learning about PDFBox?
>> >>>
>> >>> [1]
>> >>
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>> >>>
>> >>> Thank you
>> >>> Dimuthu
>> >>>
>> >>> --
>> >>> Regards
>> >>> W.Dimuthu Upeksha
>> >>> Undergraduate
>> >>> Department of Computer Science And Engineering
>> >>> University of Moratuwa, Sri Lanka
>> >
>> >
>> > --
>> > Regards
>> >
>> > W.Dimuthu Upeksha
>> > Undergraduate
>> > Department of Computer Science And Engineering
>> >
>> > University of Moratuwa, Sri Lanka
>>
>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to