Yes, exactly. By location data I just mean (x,y) coordinates and page rotation.
There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map to the wrong glyphs. We could OCR the glyphs to repair the encoding. -- John > On 25 Feb 2014, at 17:13, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > > Hi John, > Thanks for the explanation. > Let's say there is a pdf with both text in extractable format and some > images with text(Scanned images). In that case first we extract those > extractable content using PDFBox algorithms and rest is extracted using > OCR. Finally we pack both results together and give output as PDFToText. Am > I correct? What do you mean by "location data"? > > Thanks > Dimuthu > > >> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <j...@jahewson.com> wrote: >> >> 1. What is called "glyphs" ? >> >> http://en.wikipedia.org/wiki/Glyph >> >>> 2. What is the main requirement of this project? >>> As far as I understood, first we need to generate an image of >>> malformed pdfs from >>> PDFBox and then we need to do processing using OCR for further accurate >>> results. But the problem is, why shouldn't we directly do OCR on those >>> PDFs without getting output from PDFBox? Correct me if I'm wrong. >> >> PDFBox can generate images (PDFToImage) and can extract text (PDFToText). >> The goal of >> this project is to enhance PDFToText so that it can use OCR to extract >> text from areas of the >> document where the text is embedded as an image. Such PDF files are >> typically generated by >> scanners or fax machines. There is also another case where OCR is useful: >> some fonts embedded >> in PDF files contain the wrong encoding, so when text is extracted with >> PDFToText the result is >> nonsense but when drawn with PDFToImage we see the correct letters. >> >> Instead of: >> PDF => Image => OCR => Text >> >> We want to do: >> PDF => (Many images for words + location data => OCR) => Text >> >> -- John >> >>> >>> >>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >> dimuthu.upeks...@gmail.com >>>> wrote: >>> >>>> Ok fixed. This is what I did >>>> Right click on the new project ->Debug As-> Debug Configurations >> ->Source >>>> ->Add -> Project >>>> Then I selected PDFBox project. >>>> >>>> Thanks >>>> Dimuthu >>>> >>>> >>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>> dimuthu.upeks...@gmail.com> wrote: >>>> >>>>> I'm using eclipse. This is what I want. I created a new Java >> application >>>>> project (say TestPDFBox) with a main class with following code. >>>>> >>>>> PDDocument document = new PDDocument();PDPage blankPage = new >> PDPage();document.addPage( blankPage >> );document.save("BlankPage.pdf");document.close(); >>>>> >>>>> Then I need to add those jar files generated in target folder of PDFBox >>>>> to build path of my new project (I did build the PDFBox project from >>>>> source). That is what I did. But let's say I need to check the >>>>> functionality of document.save("") method. But I don't have a >> reference to >>>>> it's sources because I directly used generated jars. As Tilman said I >> built >>>>> PDFBox from sources but I don't know a proper way to use it other >> projects >>>>> other than adding those jar files to build path. >>>>> >>>>> >>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <j...@jahewson.com> >> wrote: >>>>> >>>>>> Which IDE are you using? You should be able to run the PDFToText class >>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the >> command >>>>>> line argument. >>>>>> >>>>>> -- John >>>>>> >>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >> dimuthu.upeks...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Hi John, >>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to >>>>>> build >>>>>>> code successfully. I looked at the classes you mentioned and I got a >>>>>> rough >>>>>>> idea about how they are working. To check them I used the jars in >>>>>> target >>>>>>> folder to my separate java project. I tried samples in >>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into code >>>>>>> specially how those processXXX() methods work in PDFTextStripper >> class. >>>>>>> What I usually do is adding some berakpoints and checking them in >> debug >>>>>>> windows. But using jars it's not possible. What is the way you follow >>>>>> in >>>>>>> order to do such task? >>>>>>> >>>>>>> As well I installed tesseract in to my machine and managed to do some >>>>>> OCR >>>>>>> stuff also. That's a cool tool which works fine. >>>>>>> I'm still learning the code. If I get any issue I'll drop you a mail. >>>>>>> >>>>>>> Thanks >>>>>>> Dimuthu >>>>>>> >>>>>>> >>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <j...@jahewson.com> >>>>>> wrote: >>>>>>>> >>>>>>>> Hi Dimuthu >>>>>>>> >>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it >>>>>> contains >>>>>>>> a basic overview of the project >>>>>>>> and details on how to obtain the source code and build PDFBox for >>>>>> yourself. >>>>>>>> >>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the only >>>>>>>> thoughts so far regarding it. >>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all >> under >>>>>> the >>>>>>>> Apache license, which is a >>>>>>>> requirement. >>>>>>>> >>>>>>>> Once you have the source code, take a look at the PageDrawer class >> to >>>>>> see >>>>>>>> how text and images are >>>>>>>> rendered. We want someone to interface at a low-level (e.g. one >> glyph, >>>>>>>> word, or sentence at a time) with >>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is >>>>>> currently >>>>>>>> extracted, take a look at how >>>>>>>> we have to go to great length to sort text back into reading order >> and >>>>>>>> infer the placement of diacritics - PDF >>>>>>>> is fundamentally a visual format, not a structured format like HTML >> - >>>>>>>> which is why extracting text can be so >>>>>>>> difficult sometimes. >>>>>>>> >>>>>>>> The full PDF Reference document can be found at: >> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>> >>>>>>>> Feel free to discuss specifics of your proposal or ask any >> questions. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> -- John >>>>>>>> >>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >> dimuthu.upeks...@gmail.com >>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at >>>>>> University >>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with >>>>>> Apache >>>>>>>> ISIS [1] project. I'm very much interested in OCR and image >> processing >>>>>>>> stuff. So I would like to select this project idea as my GSoC 2014 >>>>>> project >>>>>>>> because I feel like it is the best suited project for me. In >>>>>> university >>>>>>>> also we have done some research in OCR area and our group wrote a >>>>>>>> literature review about increasing efficiency of OCR >>>>>> systems(attached). Can >>>>>>>> you please suggest me where to start learning about PDFBox? >>>>>>>>> >>>>>>>>> [1] >> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>> >>>>>>>>> Thank you >>>>>>>>> Dimuthu >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards >>>>>>>>> W.Dimuthu Upeksha >>>>>>>>> Undergraduate >>>>>>>>> Department of Computer Science And Engineering >>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards >>>>>>> >>>>>>> W.Dimuthu Upeksha >>>>>>> Undergraduate >>>>>>> Department of Computer Science And Engineering >>>>>>> >>>>>>> University of Moratuwa, Sri Lanka >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards >>>>> >>>>> W.Dimuthu Upeksha >>>>> Undergraduate >>>>> Department of Computer Science And Engineering >>>>> >>>>> University of Moratuwa, Sri Lanka >>>> >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka