Re: [GSoC 2014]Optical Character Recognition project - Introduction

DImuthu Upeksha Wed, 26 Feb 2014 16:54:12 -0800

Sorry for the mistake. I added it to my Dropbox [1].

[1]
https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf


Thanks
Dimuthu


On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com> wrote:

> I should add that the OCR engine should be pluggable so PDFToText might
> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
> class somewhere which provides the required functionality and lives in a
> separate jar file.
>
> -- John
>
> > On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com> wrote:
> >
> > So do you need to embed those new functionalities into existing
> PDFtoText algorithms or package them as a new sub system(something like an
> API)?
> >
> > -----Original Message-----
> > From: "John Hewson" <j...@jahewson.com>
> > Sent: 26/02/2014 07:38
> > To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org>
> > Subject: Re: [GSoC 2014]Optical Character Recognition project -
> Introduction
> >
> > Yes, exactly. By location data I just mean (x,y) coordinates and page
> rotation.
> >
> > There is another use case for OCR: some fonts embedded in PDFs have
> corrupt encodings, which means the ACSII codes map to the wrong glyphs. We
> could OCR the glyphs to repair the encoding.
> >
> > -- John
> >
> >> On 25 Feb 2014, at 17:13, DImuthu Upeksha <dimuthu.upeks...@gmail.com>
> wrote:
> >>
> >> Hi John,
> >> Thanks for the explanation.
> >> Let's say there is a pdf with both text in extractable format and some
> >> images with text(Scanned images). In that case first we extract those
> >> extractable content using PDFBox algorithms and rest is extracted using
> >> OCR. Finally we pack both results together and give output as
> PDFToText. Am
> >> I correct? What do you mean by "location data"?
> >>
> >> Thanks
> >> Dimuthu
> >>
> >>
> >>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <j...@jahewson.com>
> wrote:
> >>>
> >>> 1. What is called "glyphs" ?
> >>>
> >>> http://en.wikipedia.org/wiki/Glyph
> >>>
> >>>> 2. What is the main requirement of this project?
> >>>> As far as I understood, first we need to generate an image of
> >>>> malformed pdfs from
> >>>> PDFBox and then we need to do processing using OCR for further
> accurate
> >>>> results.  But the problem is, why shouldn't we directly do OCR on
> those
> >>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
> >>>
> >>> PDFBox can generate images (PDFToImage) and can extract text
> (PDFToText).
> >>> The goal of
> >>> this project is to enhance PDFToText so that it can use OCR to extract
> >>> text from areas of the
> >>> document where the text is embedded as an image. Such PDF files are
> >>> typically generated by
> >>> scanners or fax machines. There is also another case where OCR is
> useful:
> >>> some fonts embedded
> >>> in PDF files contain the wrong encoding, so when text is extracted with
> >>> PDFToText the result is
> >>> nonsense but when drawn with PDFToImage we see the correct letters.
> >>>
> >>> Instead of:
> >>> PDF => Image => OCR => Text
> >>>
> >>> We want to do:
> >>> PDF => (Many images for words + location data => OCR) => Text
> >>>
> >>> -- John
> >>>
> >>>>
> >>>>
> >>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
> >>> dimuthu.upeks...@gmail.com
> >>>>> wrote:
> >>>>
> >>>>> Ok fixed. This is what I did
> >>>>> Right click on the new project ->Debug As-> Debug Configurations
> >>> ->Source
> >>>>> ->Add -> Project
> >>>>> Then I selected PDFBox project.
> >>>>>
> >>>>> Thanks
> >>>>> Dimuthu
> >>>>>
> >>>>>
> >>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> >>>>> dimuthu.upeks...@gmail.com> wrote:
> >>>>>
> >>>>>> I'm using eclipse. This is what I want. I created a new Java
> >>> application
> >>>>>> project (say TestPDFBox) with a main class with following code.
> >>>>>>
> >>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
> >>> PDPage();document.addPage( blankPage
> >>> );document.save("BlankPage.pdf");document.close();
> >>>>>>
> >>>>>> Then I need to add those jar files generated in target folder of
> PDFBox
> >>>>>> to build path of my new project (I did build the PDFBox project from
> >>>>>> source). That is what I did. But let's say I need to check  the
> >>>>>> functionality of document.save("") method. But I don't have a
> >>> reference to
> >>>>>> it's sources because I directly used generated jars. As Tilman said
> I
> >>> built
> >>>>>> PDFBox from sources but I don't know a proper way to use it other
> >>> projects
> >>>>>> other than adding those jar files to build path.
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <j...@jahewson.com>
> >>> wrote:
> >>>>>>
> >>>>>>> Which IDE are you using? You should be able to run the PDFToText
> class
> >>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
> >>> command
> >>>>>>> line argument.
> >>>>>>>
> >>>>>>> -- John
> >>>>>>>
> >>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
> >>> dimuthu.upeks...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi John,
> >>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to
> >>>>>>> build
> >>>>>>>> code successfully. I looked at the classes you mentioned and I
> got a
> >>>>>>> rough
> >>>>>>>> idea about how they are working. To check them I used the jars in
> >>>>>>> target
> >>>>>>>> folder to my separate java project. I tried samples in
> >>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into
> code
> >>>>>>>> specially how those processXXX() methods work in PDFTextStripper
> >>> class.
> >>>>>>>> What I usually do is adding some berakpoints and checking them in
> >>> debug
> >>>>>>>> windows. But using jars it's not possible. What is the way you
> follow
> >>>>>>> in
> >>>>>>>> order to do such task?
> >>>>>>>>
> >>>>>>>> As well I installed tesseract in to my machine and managed to do
> some
> >>>>>>> OCR
> >>>>>>>> stuff also. That's a cool tool which works fine.
> >>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
> mail.
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Dimuthu
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <j...@jahewson.com
> >
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Dimuthu
> >>>>>>>>>
> >>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it
> >>>>>>> contains
> >>>>>>>>> a basic overview of the project
> >>>>>>>>> and details on how to obtain the source code and build PDFBox for
> >>>>>>> yourself.
> >>>>>>>>>
> >>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the
> only
> >>>>>>>>> thoughts so far regarding it.
> >>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
> >>> under
> >>>>>>> the
> >>>>>>>>> Apache license, which is a
> >>>>>>>>> requirement.
> >>>>>>>>>
> >>>>>>>>> Once you have the source code, take a look at the PageDrawer
> class
> >>> to
> >>>>>>> see
> >>>>>>>>> how text and images are
> >>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
> >>> glyph,
> >>>>>>>>> word, or sentence at a time) with
> >>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is
> >>>>>>> currently
> >>>>>>>>> extracted, take a look at how
> >>>>>>>>> we have to go to great length to sort text back into reading
> order
> >>> and
> >>>>>>>>> infer the placement of diacritics - PDF
> >>>>>>>>> is fundamentally a visual format, not a structured format like
> HTML
> >>> -
> >>>>>>>>> which is why extracting text can be so
> >>>>>>>>> difficult sometimes.
> >>>>>>>>>
> >>>>>>>>> The full PDF Reference document can be found at:
> >>>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>>>>>>>>
> >>>>>>>>> Feel free to discuss specifics of your proposal or ask any
> >>> questions.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> -- John
> >>>>>>>>>
> >>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
> >>> dimuthu.upeks...@gmail.com
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
> >>>>>>> University
> >>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with
> >>>>>>> Apache
> >>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
> >>> processing
> >>>>>>>>> stuff. So I would like to select this project idea as my GSoC
> 2014
> >>>>>>> project
> >>>>>>>>> because I feel like it is the best suited project for me. In
> >>>>>>> university
> >>>>>>>>> also we have done some research in OCR area and our group wrote a
> >>>>>>>>> literature review about increasing efficiency of OCR
> >>>>>>> systems(attached). Can
> >>>>>>>>> you please suggest me where to start learning about PDFBox?
> >>>>>>>>>>
> >>>>>>>>>> [1]
> >>>
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>>>>>>>>
> >>>>>>>>>> Thank you
> >>>>>>>>>> Dimuthu
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Regards
> >>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>> Undergraduate
> >>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> W.Dimuthu Upeksha
> >>>>>>>> Undergraduate
> >>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>
> >>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Regards
> >>>>>>
> >>>>>> W.Dimuthu Upeksha
> >>>>>> Undergraduate
> >>>>>> Department of Computer Science And Engineering
> >>>>>>
> >>>>>> University of Moratuwa, Sri Lanka
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Regards
> >>>>>
> >>>>> W.Dimuthu Upeksha
> >>>>> Undergraduate
> >>>>> Department of Computer Science And Engineering
> >>>>>
> >>>>> University of Moratuwa, Sri Lanka
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards
> >>>>
> >>>> W.Dimuthu Upeksha
> >>>> Undergraduate
> >>>> Department of Computer Science And Engineering
> >>>>
> >>>> University of Moratuwa, Sri Lanka
> >>
> >>
> >> --
> >> Regards
> >>
> >> W.Dimuthu Upeksha
> >> Undergraduate
> >> Department of Computer Science And Engineering
> >>
> >> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to