Hi Dimuthu The PDFBox website can be found at http://pdfbox.apache.org/ it contains a basic overview of the project and details on how to obtain the source code and build PDFBox for yourself.
Currently we do not perform any OCR and PDFBOX-1912 details the only thoughts so far regarding it. Note that the OCR libraries mentioned in the JIRA issue are all under the Apache license, which is a requirement. Once you have the source code, take a look at the PageDrawer class to see how text and images are rendered. We want someone to interface at a low-level (e.g. one glyph, word, or sentence at a time) with an OCR engine. Also look at PDFTextStripper which is how text is currently extracted, take a look at how we have to go to great length to sort text back into reading order and infer the placement of diacritics - PDF is fundamentally a visual format, not a structured format like HTML - which is why extracting text can be so difficult sometimes. The full PDF Reference document can be found at: http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf Feel free to discuss specifics of your proposal or ask any questions. Thanks, -- John On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]> wrote: > Hi, > I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University of > Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache ISIS > [1] project. I'm very much interested in OCR and image processing stuff. So I > would like to select this project idea as my GSoC 2014 project because I feel > like it is the best suited project for me. In university also we have done > some research in OCR area and our group wrote a literature review about > increasing efficiency of OCR systems(attached). Can you please suggest me > where to start learning about PDFBox? > > [1]http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 > > Thank you > Dimuthu > > -- > Regards > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > University of Moratuwa, Sri Lanka
