Sorry for the mistake. I added it to my Dropbox [1]. [1] https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
Thanks Dimuthu On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com> wrote: > I should add that the OCR engine should be pluggable so PDFToText might > use an interface, e.g. OCREngine and there will be a TesseractOCREngine > class somewhere which provides the required functionality and lives in a > separate jar file. > > -- John > > > On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com> wrote: > > > > So do you need to embed those new functionalities into existing > PDFtoText algorithms or package them as a new sub system(something like an > API)? > > > > -----Original Message----- > > From: "John Hewson" <j...@jahewson.com> > > Sent: 26/02/2014 07:38 > > To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org> > > Subject: Re: [GSoC 2014]Optical Character Recognition project - > Introduction > > > > Yes, exactly. By location data I just mean (x,y) coordinates and page > rotation. > > > > There is another use case for OCR: some fonts embedded in PDFs have > corrupt encodings, which means the ACSII codes map to the wrong glyphs. We > could OCR the glyphs to repair the encoding. > > > > -- John > > > >> On 25 Feb 2014, at 17:13, DImuthu Upeksha <dimuthu.upeks...@gmail.com> > wrote: > >> > >> Hi John, > >> Thanks for the explanation. > >> Let's say there is a pdf with both text in extractable format and some > >> images with text(Scanned images). In that case first we extract those > >> extractable content using PDFBox algorithms and rest is extracted using > >> OCR. Finally we pack both results together and give output as > PDFToText. Am > >> I correct? What do you mean by "location data"? > >> > >> Thanks > >> Dimuthu > >> > >> > >>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <j...@jahewson.com> > wrote: > >>> > >>> 1. What is called "glyphs" ? > >>> > >>> http://en.wikipedia.org/wiki/Glyph > >>> > >>>> 2. What is the main requirement of this project? > >>>> As far as I understood, first we need to generate an image of > >>>> malformed pdfs from > >>>> PDFBox and then we need to do processing using OCR for further > accurate > >>>> results. But the problem is, why shouldn't we directly do OCR on > those > >>>> PDFs without getting output from PDFBox? Correct me if I'm wrong. > >>> > >>> PDFBox can generate images (PDFToImage) and can extract text > (PDFToText). > >>> The goal of > >>> this project is to enhance PDFToText so that it can use OCR to extract > >>> text from areas of the > >>> document where the text is embedded as an image. Such PDF files are > >>> typically generated by > >>> scanners or fax machines. There is also another case where OCR is > useful: > >>> some fonts embedded > >>> in PDF files contain the wrong encoding, so when text is extracted with > >>> PDFToText the result is > >>> nonsense but when drawn with PDFToImage we see the correct letters. > >>> > >>> Instead of: > >>> PDF => Image => OCR => Text > >>> > >>> We want to do: > >>> PDF => (Many images for words + location data => OCR) => Text > >>> > >>> -- John > >>> > >>>> > >>>> > >>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < > >>> dimuthu.upeks...@gmail.com > >>>>> wrote: > >>>> > >>>>> Ok fixed. This is what I did > >>>>> Right click on the new project ->Debug As-> Debug Configurations > >>> ->Source > >>>>> ->Add -> Project > >>>>> Then I selected PDFBox project. > >>>>> > >>>>> Thanks > >>>>> Dimuthu > >>>>> > >>>>> > >>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < > >>>>> dimuthu.upeks...@gmail.com> wrote: > >>>>> > >>>>>> I'm using eclipse. This is what I want. I created a new Java > >>> application > >>>>>> project (say TestPDFBox) with a main class with following code. > >>>>>> > >>>>>> PDDocument document = new PDDocument();PDPage blankPage = new > >>> PDPage();document.addPage( blankPage > >>> );document.save("BlankPage.pdf");document.close(); > >>>>>> > >>>>>> Then I need to add those jar files generated in target folder of > PDFBox > >>>>>> to build path of my new project (I did build the PDFBox project from > >>>>>> source). That is what I did. But let's say I need to check the > >>>>>> functionality of document.save("") method. But I don't have a > >>> reference to > >>>>>> it's sources because I directly used generated jars. As Tilman said > I > >>> built > >>>>>> PDFBox from sources but I don't know a proper way to use it other > >>> projects > >>>>>> other than adding those jar files to build path. > >>>>>> > >>>>>> > >>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <j...@jahewson.com> > >>> wrote: > >>>>>> > >>>>>>> Which IDE are you using? You should be able to run the PDFToText > class > >>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the > >>> command > >>>>>>> line argument. > >>>>>>> > >>>>>>> -- John > >>>>>>> > >>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < > >>> dimuthu.upeks...@gmail.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi John, > >>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to > >>>>>>> build > >>>>>>>> code successfully. I looked at the classes you mentioned and I > got a > >>>>>>> rough > >>>>>>>> idea about how they are working. To check them I used the jars in > >>>>>>> target > >>>>>>>> folder to my separate java project. I tried samples in > >>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into > code > >>>>>>>> specially how those processXXX() methods work in PDFTextStripper > >>> class. > >>>>>>>> What I usually do is adding some berakpoints and checking them in > >>> debug > >>>>>>>> windows. But using jars it's not possible. What is the way you > follow > >>>>>>> in > >>>>>>>> order to do such task? > >>>>>>>> > >>>>>>>> As well I installed tesseract in to my machine and managed to do > some > >>>>>>> OCR > >>>>>>>> stuff also. That's a cool tool which works fine. > >>>>>>>> I'm still learning the code. If I get any issue I'll drop you a > mail. > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> Dimuthu > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <j...@jahewson.com > > > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi Dimuthu > >>>>>>>>> > >>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it > >>>>>>> contains > >>>>>>>>> a basic overview of the project > >>>>>>>>> and details on how to obtain the source code and build PDFBox for > >>>>>>> yourself. > >>>>>>>>> > >>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the > only > >>>>>>>>> thoughts so far regarding it. > >>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all > >>> under > >>>>>>> the > >>>>>>>>> Apache license, which is a > >>>>>>>>> requirement. > >>>>>>>>> > >>>>>>>>> Once you have the source code, take a look at the PageDrawer > class > >>> to > >>>>>>> see > >>>>>>>>> how text and images are > >>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one > >>> glyph, > >>>>>>>>> word, or sentence at a time) with > >>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is > >>>>>>> currently > >>>>>>>>> extracted, take a look at how > >>>>>>>>> we have to go to great length to sort text back into reading > order > >>> and > >>>>>>>>> infer the placement of diacritics - PDF > >>>>>>>>> is fundamentally a visual format, not a structured format like > HTML > >>> - > >>>>>>>>> which is why extracting text can be so > >>>>>>>>> difficult sometimes. > >>>>>>>>> > >>>>>>>>> The full PDF Reference document can be found at: > >>> > http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf > >>>>>>>>> > >>>>>>>>> Feel free to discuss specifics of your proposal or ask any > >>> questions. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>>> -- John > >>>>>>>>> > >>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < > >>> dimuthu.upeks...@gmail.com > >>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at > >>>>>>> University > >>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with > >>>>>>> Apache > >>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image > >>> processing > >>>>>>>>> stuff. So I would like to select this project idea as my GSoC > 2014 > >>>>>>> project > >>>>>>>>> because I feel like it is the best suited project for me. In > >>>>>>> university > >>>>>>>>> also we have done some research in OCR area and our group wrote a > >>>>>>>>> literature review about increasing efficiency of OCR > >>>>>>> systems(attached). Can > >>>>>>>>> you please suggest me where to start learning about PDFBox? > >>>>>>>>>> > >>>>>>>>>> [1] > >>> > http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 > >>>>>>>>>> > >>>>>>>>>> Thank you > >>>>>>>>>> Dimuthu > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Regards > >>>>>>>>>> W.Dimuthu Upeksha > >>>>>>>>>> Undergraduate > >>>>>>>>>> Department of Computer Science And Engineering > >>>>>>>>>> University of Moratuwa, Sri Lanka > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Regards > >>>>>>>> > >>>>>>>> W.Dimuthu Upeksha > >>>>>>>> Undergraduate > >>>>>>>> Department of Computer Science And Engineering > >>>>>>>> > >>>>>>>> University of Moratuwa, Sri Lanka > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Regards > >>>>>> > >>>>>> W.Dimuthu Upeksha > >>>>>> Undergraduate > >>>>>> Department of Computer Science And Engineering > >>>>>> > >>>>>> University of Moratuwa, Sri Lanka > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Regards > >>>>> > >>>>> W.Dimuthu Upeksha > >>>>> Undergraduate > >>>>> Department of Computer Science And Engineering > >>>>> > >>>>> University of Moratuwa, Sri Lanka > >>>> > >>>> > >>>> > >>>> -- > >>>> Regards > >>>> > >>>> W.Dimuthu Upeksha > >>>> Undergraduate > >>>> Department of Computer Science And Engineering > >>>> > >>>> University of Moratuwa, Sri Lanka > >> > >> > >> -- > >> Regards > >> > >> W.Dimuthu Upeksha > >> Undergraduate > >> Department of Computer Science And Engineering > >> > >> University of Moratuwa, Sri Lanka > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka