Hi John,
Thanks for the reply. Yes I checked out PDFBox code and managed to build
code successfully. I looked at the classes you mentioned and I got a rough
idea about how they are working. To check them I used the jars in target
folder to my separate java project. I tried samples in
http://pdfbox.apache.org/cookbook/. I need to further look into code
specially how those processXXX() methods work in PDFTextStripper class.
What I usually do is adding some berakpoints and checking them in debug
windows. But using jars it's not possible. What is the way you follow in
order to do such task?

As well I installed tesseract in to my machine and managed to do some OCR
stuff also. That's a cool tool which works fine.
I'm still learning the code. If I get any issue I'll drop you a mail.

Thanks
Dimuthu


On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]> wrote:

> Hi Dimuthu
>
> The PDFBox website can be found at http://pdfbox.apache.org/ it contains
> a basic overview of the project
> and details on how to obtain the source code and build PDFBox for yourself.
>
> Currently we do not perform any OCR and PDFBOX-1912 details the only
> thoughts so far regarding it.
> Note that the OCR libraries mentioned in the JIRA issue are all under the
> Apache license, which is a
> requirement.
>
> Once you have the source code, take a look at the PageDrawer class to see
> how text and images are
> rendered. We want someone to interface at a low-level (e.g. one glyph,
> word, or sentence at a time) with
> an OCR engine. Also look at PDFTextStripper which is how text is currently
> extracted, take a look at how
> we have to go to great length to sort text back into reading order and
> infer the placement of diacritics - PDF
> is fundamentally a visual format, not a structured format like HTML -
> which is why extracting text can be so
> difficult sometimes.
>
> The full PDF Reference document can be found at:
>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>
> Feel free to discuss specifics of your proposal or ask any questions.
>
> Thanks,
>
> -- John
>
> On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]>
> wrote:
>
> > Hi,
> > I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University
> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache
> ISIS [1] project. I'm very much interested in OCR and image processing
> stuff. So I would like to select this project idea as my GSoC 2014 project
> because I feel like it is the best suited project for me. In university
> also we have done some research in OCR area and our group wrote a
> literature review about increasing efficiency of OCR systems(attached). Can
> you please suggest me where to start learning about PDFBox?
> >
> > [1]
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >
> > Thank you
> > Dimuthu
> >
> > --
> > Regards
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> > University of Moratuwa, Sri Lanka
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Reply via email to