Dimuthu Our mailing list doesn’t support attachments (and fails silently when they are used), so can you please post the file somewhere publicly and send a hyperlink to the mailing list.
-- John On 26 Feb 2014, at 07:36, DImuthu Upeksha <[email protected]> wrote: > Hi John, > I have attached a top view architecture diagram for this project as far as I > have understood. Please have a look at it. This may not be the perfect one > but all I need is to make sure that I'm in the correct track in requirement > gathering. I have used a OCR plugin to connect to Tesseract instead of direct > calling because it facilitates us to connect another OCR library in future > without an extra effort. Adder is responsible for binding extracted text > together according to location data. Waiting for your comments. > > Thanks > Dimuthu > > > On Wed, Feb 26, 2014 at 9:48 AM, Dimuthu <[email protected]> wrote: > So do you need to embed those new functionalities into existing PDFtoText > algorithms or package them as a new sub system(something like an API)? > From: John Hewson > Sent: 26/02/2014 07:38 > To: [email protected] > Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction > > Yes, exactly. By location data I just mean (x,y) coordinates and page > rotation. > > There is another use case for OCR: some fonts embedded in PDFs have corrupt > encodings, which means the ACSII codes map to the wrong glyphs. We could OCR > the glyphs to repair the encoding. > > -- John > > > On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]> > > wrote: > > > > Hi John, > > Thanks for the explanation. > > Let's say there is a pdf with both text in extractable format and some > > images with text(Scanned images). In that case first we extract those > > extractable content using PDFBox algorithms and rest is extracted using > > OCR. Finally we pack both results together and give output as PDFToText. Am > > I correct? What do you mean by "location data"? > > > > Thanks > > Dimuthu > > > > > >> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> wrote: > >> > >> 1. What is called "glyphs" ? > >> > >> http://en.wikipedia.org/wiki/Glyph > >> > >>> 2. What is the main requirement of this project? > >>> As far as I understood, first we need to generate an image of > >>> malformed pdfs from > >>> PDFBox and then we need to do processing using OCR for further accurate > >>> results. But the problem is, why shouldn't we directly do OCR on those > >>> PDFs without getting output from PDFBox? Correct me if I'm wrong. > >> > >> PDFBox can generate images (PDFToImage) and can extract text (PDFToText). > >> The goal of > >> this project is to enhance PDFToText so that it can use OCR to extract > >> text from areas of the > >> document where the text is embedded as an image. Such PDF files are > >> typically generated by > >> scanners or fax machines. There is also another case where OCR is useful: > >> some fonts embedded > >> in PDF files contain the wrong encoding, so when text is extracted with > >> PDFToText the result is > >> nonsense but when drawn with PDFToImage we see the correct letters. > >> > >> Instead of: > >> PDF => Image => OCR => Text > >> > >> We want to do: > >> PDF => (Many images for words + location data => OCR) => Text > >> > >> -- John > >> > >>> > >>> > >>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < > >> [email protected] > >>>> wrote: > >>> > >>>> Ok fixed. This is what I did > >>>> Right click on the new project ->Debug As-> Debug Configurations > >> ->Source > >>>> ->Add -> Project > >>>> Then I selected PDFBox project. > >>>> > >>>> Thanks > >>>> Dimuthu > >>>> > >>>> > >>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < > >>>> [email protected]> wrote: > >>>> > >>>>> I'm using eclipse. This is what I want. I created a new Java > >> application > >>>>> project (say TestPDFBox) with a main class with following code. > >>>>> > >>>>> PDDocument document = new PDDocument();PDPage blankPage = new > >> PDPage();document.addPage( blankPage > >> );document.save("BlankPage.pdf");document.close(); > >>>>> > >>>>> Then I need to add those jar files generated in target folder of PDFBox > >>>>> to build path of my new project (I did build the PDFBox project from > >>>>> source). That is what I did. But let's say I need to check the > >>>>> functionality of document.save("") method. But I don't have a > >> reference to > >>>>> it's sources because I directly used generated jars. As Tilman said I > >> built > >>>>> PDFBox from sources but I don't know a proper way to use it other > >> projects > >>>>> other than adding those jar files to build path. > >>>>> > >>>>> > >>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> > >> wrote: > >>>>> > >>>>>> Which IDE are you using? You should be able to run the PDFToText class > >>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the > >> command > >>>>>> line argument. > >>>>>> > >>>>>> -- John > >>>>>> > >>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < > >> [email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi John, > >>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to > >>>>>> build > >>>>>>> code successfully. I looked at the classes you mentioned and I got a > >>>>>> rough > >>>>>>> idea about how they are working. To check them I used the jars in > >>>>>> target > >>>>>>> folder to my separate java project. I tried samples in > >>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into code > >>>>>>> specially how those processXXX() methods work in PDFTextStripper > >> class. > >>>>>>> What I usually do is adding some berakpoints and checking them in > >> debug > >>>>>>> windows. But using jars it's not possible. What is the way you follow > >>>>>> in > >>>>>>> order to do such task? > >>>>>>> > >>>>>>> As well I installed tesseract in to my machine and managed to do some > >>>>>> OCR > >>>>>>> stuff also. That's a cool tool which works fine. > >>>>>>> I'm still learning the code. If I get any issue I'll drop you a mail. > >>>>>>> > >>>>>>> Thanks > >>>>>>> Dimuthu > >>>>>>> > >>>>>>> > >>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]> > >>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi Dimuthu > >>>>>>>> > >>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it > >>>>>> contains > >>>>>>>> a basic overview of the project > >>>>>>>> and details on how to obtain the source code and build PDFBox for > >>>>>> yourself. > >>>>>>>> > >>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the only > >>>>>>>> thoughts so far regarding it. > >>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all > >> under > >>>>>> the > >>>>>>>> Apache license, which is a > >>>>>>>> requirement. > >>>>>>>> > >>>>>>>> Once you have the source code, take a look at the PageDrawer class > >> to > >>>>>> see > >>>>>>>> how text and images are > >>>>>>>> rendered. We want someone to interface at a low-level (e.g. one > >> glyph, > >>>>>>>> word, or sentence at a time) with > >>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is > >>>>>> currently > >>>>>>>> extracted, take a look at how > >>>>>>>> we have to go to great length to sort text back into reading order > >> and > >>>>>>>> infer the placement of diacritics - PDF > >>>>>>>> is fundamentally a visual format, not a structured format like HTML > >> - > >>>>>>>> which is why extracting text can be so > >>>>>>>> difficult sometimes. > >>>>>>>> > >>>>>>>> The full PDF Reference document can be found at: > >> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf > >>>>>>>> > >>>>>>>> Feel free to discuss specifics of your proposal or ask any > >> questions. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> -- John > >>>>>>>> > >>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < > >> [email protected] > >>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at > >>>>>> University > >>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with > >>>>>> Apache > >>>>>>>> ISIS [1] project. I'm very much interested in OCR and image > >> processing > >>>>>>>> stuff. So I would like to select this project idea as my GSoC 2014 > >>>>>> project > >>>>>>>> because I feel like it is the best suited project for me. In > >>>>>> university > >>>>>>>> also we have done some research in OCR area and our group wrote a > >>>>>>>> literature review about increasing efficiency of OCR > >>>>>> systems(attached). Can > >>>>>>>> you please suggest me where to start learning about PDFBox? > >>>>>>>>> > >>>>>>>>> [1] > >> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 > >>>>>>>>> > >>>>>>>>> Thank you > >>>>>>>>> Dimuthu > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Regards > >>>>>>>>> W.Dimuthu Upeksha > >>>>>>>>> Undergraduate > >>>>>>>>> Department of Computer Science And Engineering > >>>>>>>>> University of Moratuwa, Sri Lanka > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Regards > >>>>>>> > >>>>>>> W.Dimuthu Upeksha > >>>>>>> Undergraduate > >>>>>>> Department of Computer Science And Engineering > >>>>>>> > >>>>>>> University of Moratuwa, Sri Lanka > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Regards > >>>>> > >>>>> W.Dimuthu Upeksha > >>>>> Undergraduate > >>>>> Department of Computer Science And Engineering > >>>>> > >>>>> University of Moratuwa, Sri Lanka > >>>> > >>>> > >>>> > >>>> -- > >>>> Regards > >>>> > >>>> W.Dimuthu Upeksha > >>>> Undergraduate > >>>> Department of Computer Science And Engineering > >>>> > >>>> University of Moratuwa, Sri Lanka > >>> > >>> > >>> > >>> -- > >>> Regards > >>> > >>> W.Dimuthu Upeksha > >>> Undergraduate > >>> Department of Computer Science And Engineering > >>> > >>> University of Moratuwa, Sri Lanka > > > > > > -- > > Regards > > > > W.Dimuthu Upeksha > > Undergraduate > > Department of Computer Science And Engineering > > > > University of Moratuwa, Sri Lanka > > > > -- > Regards > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > University of Moratuwa, Sri Lanka
