This is a good start. However, there is no need for the Adder component, “Extracted Text (OCR) can just feed back into the PDFBox “Text Extractor”.
Maybe show a “PDF” file feeding in to “Text Extractor, to make it clear where the process starts. -- John On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]> wrote: > Sorry for the mistake. I added it to my Dropbox [1]. > > [1] > https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf > > Thanks > Dimuthu > > > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote: > >> I should add that the OCR engine should be pluggable so PDFToText might >> use an interface, e.g. OCREngine and there will be a TesseractOCREngine >> class somewhere which provides the required functionality and lives in a >> separate jar file. >> >> -- John >> >>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote: >>> >>> So do you need to embed those new functionalities into existing >> PDFtoText algorithms or package them as a new sub system(something like an >> API)? >>> >>> -----Original Message----- >>> From: "John Hewson" <[email protected]> >>> Sent: 26/02/2014 07:38 >>> To: "[email protected]" <[email protected]> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >> Introduction >>> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page >> rotation. >>> >>> There is another use case for OCR: some fonts embedded in PDFs have >> corrupt encodings, which means the ACSII codes map to the wrong glyphs. We >> could OCR the glyphs to repair the encoding. >>> >>> -- John >>> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]> >> wrote: >>>> >>>> Hi John, >>>> Thanks for the explanation. >>>> Let's say there is a pdf with both text in extractable format and some >>>> images with text(Scanned images). In that case first we extract those >>>> extractable content using PDFBox algorithms and rest is extracted using >>>> OCR. Finally we pack both results together and give output as >> PDFToText. Am >>>> I correct? What do you mean by "location data"? >>>> >>>> Thanks >>>> Dimuthu >>>> >>>> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> >> wrote: >>>>> >>>>> 1. What is called "glyphs" ? >>>>> >>>>> http://en.wikipedia.org/wiki/Glyph >>>>> >>>>>> 2. What is the main requirement of this project? >>>>>> As far as I understood, first we need to generate an image of >>>>>> malformed pdfs from >>>>>> PDFBox and then we need to do processing using OCR for further >> accurate >>>>>> results. But the problem is, why shouldn't we directly do OCR on >> those >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong. >>>>> >>>>> PDFBox can generate images (PDFToImage) and can extract text >> (PDFToText). >>>>> The goal of >>>>> this project is to enhance PDFToText so that it can use OCR to extract >>>>> text from areas of the >>>>> document where the text is embedded as an image. Such PDF files are >>>>> typically generated by >>>>> scanners or fax machines. There is also another case where OCR is >> useful: >>>>> some fonts embedded >>>>> in PDF files contain the wrong encoding, so when text is extracted with >>>>> PDFToText the result is >>>>> nonsense but when drawn with PDFToImage we see the correct letters. >>>>> >>>>> Instead of: >>>>> PDF => Image => OCR => Text >>>>> >>>>> We want to do: >>>>> PDF => (Many images for words + location data => OCR) => Text >>>>> >>>>> -- John >>>>> >>>>>> >>>>>> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>> [email protected] >>>>>>> wrote: >>>>>> >>>>>>> Ok fixed. This is what I did >>>>>>> Right click on the new project ->Debug As-> Debug Configurations >>>>> ->Source >>>>>>> ->Add -> Project >>>>>>> Then I selected PDFBox project. >>>>>>> >>>>>>> Thanks >>>>>>> Dimuthu >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java >>>>> application >>>>>>>> project (say TestPDFBox) with a main class with following code. >>>>>>>> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new >>>>> PDPage();document.addPage( blankPage >>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>> >>>>>>>> Then I need to add those jar files generated in target folder of >> PDFBox >>>>>>>> to build path of my new project (I did build the PDFBox project from >>>>>>>> source). That is what I did. But let's say I need to check the >>>>>>>> functionality of document.save("") method. But I don't have a >>>>> reference to >>>>>>>> it's sources because I directly used generated jars. As Tilman said >> I >>>>> built >>>>>>>> PDFBox from sources but I don't know a proper way to use it other >>>>> projects >>>>>>>> other than adding those jar files to build path. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> >>>>> wrote: >>>>>>>> >>>>>>>>> Which IDE are you using? You should be able to run the PDFToText >> class >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the >>>>> command >>>>>>>>> line argument. >>>>>>>>> >>>>>>>>> -- John >>>>>>>>> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi John, >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to >>>>>>>>> build >>>>>>>>>> code successfully. I looked at the classes you mentioned and I >> got a >>>>>>>>> rough >>>>>>>>>> idea about how they are working. To check them I used the jars in >>>>>>>>> target >>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into >> code >>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper >>>>> class. >>>>>>>>>> What I usually do is adding some berakpoints and checking them in >>>>> debug >>>>>>>>>> windows. But using jars it's not possible. What is the way you >> follow >>>>>>>>> in >>>>>>>>>> order to do such task? >>>>>>>>>> >>>>>>>>>> As well I installed tesseract in to my machine and managed to do >> some >>>>>>>>> OCR >>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a >> mail. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Dimuthu >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected] >>> >>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Dimuthu >>>>>>>>>>> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it >>>>>>>>> contains >>>>>>>>>>> a basic overview of the project >>>>>>>>>>> and details on how to obtain the source code and build PDFBox for >>>>>>>>> yourself. >>>>>>>>>>> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the >> only >>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all >>>>> under >>>>>>>>> the >>>>>>>>>>> Apache license, which is a >>>>>>>>>>> requirement. >>>>>>>>>>> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer >> class >>>>> to >>>>>>>>> see >>>>>>>>>>> how text and images are >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one >>>>> glyph, >>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is >>>>>>>>> currently >>>>>>>>>>> extracted, take a look at how >>>>>>>>>>> we have to go to great length to sort text back into reading >> order >>>>> and >>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>> is fundamentally a visual format, not a structured format like >> HTML >>>>> - >>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>> difficult sometimes. >>>>>>>>>>> >>>>>>>>>>> The full PDF Reference document can be found at: >>>>> >> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any >>>>> questions. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> -- John >>>>>>>>>>> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>> [email protected] >>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at >>>>>>>>> University >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with >>>>>>>>> Apache >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image >>>>> processing >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC >> 2014 >>>>>>>>> project >>>>>>>>>>> because I feel like it is the best suited project for me. In >>>>>>>>> university >>>>>>>>>>> also we have done some research in OCR area and our group wrote a >>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>> systems(attached). Can >>>>>>>>>>> you please suggest me where to start learning about PDFBox? >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>> >> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>> >>>>>>>>>>>> Thank you >>>>>>>>>>>> Dimuthu >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Regards >>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>> Undergraduate >>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>> Undergraduate >>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>> >>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards >>>>>>>> >>>>>>>> W.Dimuthu Upeksha >>>>>>>> Undergraduate >>>>>>>> Department of Computer Science And Engineering >>>>>>>> >>>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Regards >>>>>>> >>>>>>> W.Dimuthu Upeksha >>>>>>> Undergraduate >>>>>>> Department of Computer Science And Engineering >>>>>>> >>>>>>> University of Moratuwa, Sri Lanka >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >> > > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka
