Hi Dimuthu

The PDFBox website can be found at http://pdfbox.apache.org/ it contains a 
basic overview of the project
and details on how to obtain the source code and build PDFBox for yourself.

Currently we do not perform any OCR and PDFBOX-1912 details the only thoughts 
so far regarding it.
Note that the OCR libraries mentioned in the JIRA issue are all under the 
Apache license, which is a
requirement.

Once you have the source code, take a look at the PageDrawer class to see how 
text and images are
rendered. We want someone to interface at a low-level (e.g. one glyph, word, or 
sentence at a time) with
an OCR engine. Also look at PDFTextStripper which is how text is currently 
extracted, take a look at how
we have to go to great length to sort text back into reading order and infer 
the placement of diacritics - PDF
is fundamentally a visual format, not a structured format like HTML - which is 
why extracting text can be so
difficult sometimes.

The full PDF Reference document can be found at:
http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Feel free to discuss specifics of your proposal or ask any questions.

Thanks,

-- John

On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]> wrote:

> Hi,
> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University of 
> Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache ISIS 
> [1] project. I'm very much interested in OCR and image processing stuff. So I 
> would like to select this project idea as my GSoC 2014 project because I feel 
> like it is the best suited project for me. In university also we have done 
> some research in OCR area and our group wrote a literature review about 
> increasing efficiency of OCR systems(attached). Can you please suggest me 
> where to start learning about PDFBox?
> 
> [1]http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> 
> Thank you
> Dimuthu
> 
> -- 
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka

Reply via email to