Hi John, Thanks for the reply. Yes I checked out PDFBox code and managed to build code successfully. I looked at the classes you mentioned and I got a rough idea about how they are working. To check them I used the jars in target folder to my separate java project. I tried samples in http://pdfbox.apache.org/cookbook/. I need to further look into code specially how those processXXX() methods work in PDFTextStripper class. What I usually do is adding some berakpoints and checking them in debug windows. But using jars it's not possible. What is the way you follow in order to do such task?
As well I installed tesseract in to my machine and managed to do some OCR stuff also. That's a cool tool which works fine. I'm still learning the code. If I get any issue I'll drop you a mail. Thanks Dimuthu On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]> wrote: > Hi Dimuthu > > The PDFBox website can be found at http://pdfbox.apache.org/ it contains > a basic overview of the project > and details on how to obtain the source code and build PDFBox for yourself. > > Currently we do not perform any OCR and PDFBOX-1912 details the only > thoughts so far regarding it. > Note that the OCR libraries mentioned in the JIRA issue are all under the > Apache license, which is a > requirement. > > Once you have the source code, take a look at the PageDrawer class to see > how text and images are > rendered. We want someone to interface at a low-level (e.g. one glyph, > word, or sentence at a time) with > an OCR engine. Also look at PDFTextStripper which is how text is currently > extracted, take a look at how > we have to go to great length to sort text back into reading order and > infer the placement of diacritics - PDF > is fundamentally a visual format, not a structured format like HTML - > which is why extracting text can be so > difficult sometimes. > > The full PDF Reference document can be found at: > > http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf > > Feel free to discuss specifics of your proposal or ask any questions. > > Thanks, > > -- John > > On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]> > wrote: > > > Hi, > > I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University > of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache > ISIS [1] project. I'm very much interested in OCR and image processing > stuff. So I would like to select this project idea as my GSoC 2014 project > because I feel like it is the best suited project for me. In university > also we have done some research in OCR area and our group wrote a > literature review about increasing efficiency of OCR systems(attached). Can > you please suggest me where to start learning about PDFBox? > > > > [1] > http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 > > > > Thank you > > Dimuthu > > > > -- > > Regards > > W.Dimuthu Upeksha > > Undergraduate > > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka > > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka
