Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Wed, 26 Feb 2014 13:07:30 -0800

Dimuthu

Our mailing list doesn’t support attachments (and fails silently when they are 
used), so can
you please post the file somewhere publicly and send a hyperlink to the mailing 
list.


-- John

On 26 Feb 2014, at 07:36, DImuthu Upeksha <[email protected]> wrote:

> Hi John,
> I have attached a top view architecture diagram for this project as far as I 
> have understood. Please have a look at it. This may not be the perfect one 
> but all I need is to make sure that I'm in the correct track in requirement 
> gathering. I have used a OCR plugin to connect to Tesseract instead of direct 
> calling because it facilitates us to connect another OCR library in future 
> without an extra effort. Adder is responsible for binding extracted text 
> together according to location data. Waiting for your comments.
> 
> Thanks
> Dimuthu
> 
> 
> On Wed, Feb 26, 2014 at 9:48 AM, Dimuthu <[email protected]> wrote:
> So do you need to embed those new functionalities into existing PDFtoText 
> algorithms or package them as a new sub system(something like an API)?
> From: John Hewson
> Sent: 26/02/2014 07:38
> To: [email protected]
> Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction
> 
> Yes, exactly. By location data I just mean (x,y) coordinates and page 
> rotation.
> 
> There is another use case for OCR: some fonts embedded in PDFs have corrupt 
> encodings, which means the ACSII codes map to the wrong glyphs. We could OCR 
> the glyphs to repair the encoding.
> 
> -- John
> 
> > On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]> 
> > wrote:
> > 
> > Hi John,
> > Thanks for the explanation.
> > Let's say there is a pdf with both text in extractable format and some
> > images with text(Scanned images). In that case first we extract those
> > extractable content using PDFBox algorithms and rest is extracted using
> > OCR. Finally we pack both results together and give output as PDFToText. Am
> > I correct? What do you mean by "location data"?
> > 
> > Thanks
> > Dimuthu
> > 
> > 
> >> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> wrote:
> >> 
> >> 1. What is called "glyphs" ?
> >> 
> >> http://en.wikipedia.org/wiki/Glyph
> >> 
> >>> 2. What is the main requirement of this project?
> >>> As far as I understood, first we need to generate an image of
> >>> malformed pdfs from
> >>> PDFBox and then we need to do processing using OCR for further accurate
> >>> results.  But the problem is, why shouldn't we directly do OCR on those
> >>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
> >> 
> >> PDFBox can generate images (PDFToImage) and can extract text (PDFToText).
> >> The goal of
> >> this project is to enhance PDFToText so that it can use OCR to extract
> >> text from areas of the
> >> document where the text is embedded as an image. Such PDF files are
> >> typically generated by
> >> scanners or fax machines. There is also another case where OCR is useful:
> >> some fonts embedded
> >> in PDF files contain the wrong encoding, so when text is extracted with
> >> PDFToText the result is
> >> nonsense but when drawn with PDFToImage we see the correct letters.
> >> 
> >> Instead of:
> >> PDF => Image => OCR => Text
> >> 
> >> We want to do:
> >> PDF => (Many images for words + location data => OCR) => Text
> >> 
> >> -- John
> >> 
> >>> 
> >>> 
> >>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
> >> [email protected]
> >>>> wrote:
> >>> 
> >>>> Ok fixed. This is what I did
> >>>> Right click on the new project ->Debug As-> Debug Configurations
> >> ->Source
> >>>> ->Add -> Project
> >>>> Then I selected PDFBox project.
> >>>> 
> >>>> Thanks
> >>>> Dimuthu
> >>>> 
> >>>> 
> >>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> >>>> [email protected]> wrote:
> >>>> 
> >>>>> I'm using eclipse. This is what I want. I created a new Java
> >> application
> >>>>> project (say TestPDFBox) with a main class with following code.
> >>>>> 
> >>>>> PDDocument document = new PDDocument();PDPage blankPage = new
> >> PDPage();document.addPage( blankPage
> >> );document.save("BlankPage.pdf");document.close();
> >>>>> 
> >>>>> Then I need to add those jar files generated in target folder of PDFBox
> >>>>> to build path of my new project (I did build the PDFBox project from
> >>>>> source). That is what I did. But let's say I need to check  the
> >>>>> functionality of document.save("") method. But I don't have a
> >> reference to
> >>>>> it's sources because I directly used generated jars. As Tilman said I
> >> built
> >>>>> PDFBox from sources but I don't know a proper way to use it other
> >> projects
> >>>>> other than adding those jar files to build path.
> >>>>> 
> >>>>> 
> >>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]>
> >> wrote:
> >>>>> 
> >>>>>> Which IDE are you using? You should be able to run the PDFToText class
> >>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
> >> command
> >>>>>> line argument.
> >>>>>> 
> >>>>>> -- John
> >>>>>> 
> >>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>> 
> >>>>>>> Hi John,
> >>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to
> >>>>>> build
> >>>>>>> code successfully. I looked at the classes you mentioned and I got a
> >>>>>> rough
> >>>>>>> idea about how they are working. To check them I used the jars in
> >>>>>> target
> >>>>>>> folder to my separate java project. I tried samples in
> >>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into code
> >>>>>>> specially how those processXXX() methods work in PDFTextStripper
> >> class.
> >>>>>>> What I usually do is adding some berakpoints and checking them in
> >> debug
> >>>>>>> windows. But using jars it's not possible. What is the way you follow
> >>>>>> in
> >>>>>>> order to do such task?
> >>>>>>> 
> >>>>>>> As well I installed tesseract in to my machine and managed to do some
> >>>>>> OCR
> >>>>>>> stuff also. That's a cool tool which works fine.
> >>>>>>> I'm still learning the code. If I get any issue I'll drop you a mail.
> >>>>>>> 
> >>>>>>> Thanks
> >>>>>>> Dimuthu
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]>
> >>>>>> wrote:
> >>>>>>>> 
> >>>>>>>> Hi Dimuthu
> >>>>>>>> 
> >>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it
> >>>>>> contains
> >>>>>>>> a basic overview of the project
> >>>>>>>> and details on how to obtain the source code and build PDFBox for
> >>>>>> yourself.
> >>>>>>>> 
> >>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the only
> >>>>>>>> thoughts so far regarding it.
> >>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
> >> under
> >>>>>> the
> >>>>>>>> Apache license, which is a
> >>>>>>>> requirement.
> >>>>>>>> 
> >>>>>>>> Once you have the source code, take a look at the PageDrawer class
> >> to
> >>>>>> see
> >>>>>>>> how text and images are
> >>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
> >> glyph,
> >>>>>>>> word, or sentence at a time) with
> >>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is
> >>>>>> currently
> >>>>>>>> extracted, take a look at how
> >>>>>>>> we have to go to great length to sort text back into reading order
> >> and
> >>>>>>>> infer the placement of diacritics - PDF
> >>>>>>>> is fundamentally a visual format, not a structured format like HTML
> >> -
> >>>>>>>> which is why extracting text can be so
> >>>>>>>> difficult sometimes.
> >>>>>>>> 
> >>>>>>>> The full PDF Reference document can be found at:
> >> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>>>>>>> 
> >>>>>>>> Feel free to discuss specifics of your proposal or ask any
> >> questions.
> >>>>>>>> 
> >>>>>>>> Thanks,
> >>>>>>>> 
> >>>>>>>> -- John
> >>>>>>>> 
> >>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
> >> [email protected]
> >>>>>>> 
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> Hi,
> >>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
> >>>>>> University
> >>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with
> >>>>>> Apache
> >>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
> >> processing
> >>>>>>>> stuff. So I would like to select this project idea as my GSoC 2014
> >>>>>> project
> >>>>>>>> because I feel like it is the best suited project for me. In
> >>>>>> university
> >>>>>>>> also we have done some research in OCR area and our group wrote a
> >>>>>>>> literature review about increasing efficiency of OCR
> >>>>>> systems(attached). Can
> >>>>>>>> you please suggest me where to start learning about PDFBox?
> >>>>>>>>> 
> >>>>>>>>> [1]
> >> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>>>>>>> 
> >>>>>>>>> Thank you
> >>>>>>>>> Dimuthu
> >>>>>>>>> 
> >>>>>>>>> --
> >>>>>>>>> Regards
> >>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>> Undergraduate
> >>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>> 
> >>>>>>> 
> >>>>>>> --
> >>>>>>> Regards
> >>>>>>> 
> >>>>>>> W.Dimuthu Upeksha
> >>>>>>> Undergraduate
> >>>>>>> Department of Computer Science And Engineering
> >>>>>>> 
> >>>>>>> University of Moratuwa, Sri Lanka
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> --
> >>>>> Regards
> >>>>> 
> >>>>> W.Dimuthu Upeksha
> >>>>> Undergraduate
> >>>>> Department of Computer Science And Engineering
> >>>>> 
> >>>>> University of Moratuwa, Sri Lanka
> >>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> Regards
> >>>> 
> >>>> W.Dimuthu Upeksha
> >>>> Undergraduate
> >>>> Department of Computer Science And Engineering
> >>>> 
> >>>> University of Moratuwa, Sri Lanka
> >>> 
> >>> 
> >>> 
> >>> --
> >>> Regards
> >>> 
> >>> W.Dimuthu Upeksha
> >>> Undergraduate
> >>> Department of Computer Science And Engineering
> >>> 
> >>> University of Moratuwa, Sri Lanka
> > 
> > 
> > -- 
> > Regards
> > 
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> > 
> > University of Moratuwa, Sri Lanka
> 
> 
> 
> -- 
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to