Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Fri, 28 Feb 2014 10:58:33 -0800

This is a good start. However, there is no need for the Adder component, 
“Extracted Text (OCR) can just feed back into the PDFBox “Text Extractor”.


Maybe show a “PDF” file feeding in to “Text Extractor, to make it clear where 
the process starts.

-- John

On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]> wrote:

> Sorry for the mistake. I added it to my Dropbox [1].
> 
> [1]
> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
> 
> Thanks
> Dimuthu
> 
> 
> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> wrote:
> 
>> I should add that the OCR engine should be pluggable so PDFToText might
>> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
>> class somewhere which provides the required functionality and lives in a
>> separate jar file.
>> 
>> -- John
>> 
>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> wrote:
>>> 
>>> So do you need to embed those new functionalities into existing
>> PDFtoText algorithms or package them as a new sub system(something like an
>> API)?
>>> 
>>> -----Original Message-----
>>> From: "John Hewson" <[email protected]>
>>> Sent: 26/02/2014 07:38
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>> Introduction
>>> 
>>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>> rotation.
>>> 
>>> There is another use case for OCR: some fonts embedded in PDFs have
>> corrupt encodings, which means the ACSII codes map to the wrong glyphs. We
>> could OCR the glyphs to repair the encoding.
>>> 
>>> -- John
>>> 
>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <[email protected]>
>> wrote:
>>>> 
>>>> Hi John,
>>>> Thanks for the explanation.
>>>> Let's say there is a pdf with both text in extractable format and some
>>>> images with text(Scanned images). In that case first we extract those
>>>> extractable content using PDFBox algorithms and rest is extracted using
>>>> OCR. Finally we pack both results together and give output as
>> PDFToText. Am
>>>> I correct? What do you mean by "location data"?
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]>
>> wrote:
>>>>> 
>>>>> 1. What is called "glyphs" ?
>>>>> 
>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>> 
>>>>>> 2. What is the main requirement of this project?
>>>>>> As far as I understood, first we need to generate an image of
>>>>>> malformed pdfs from
>>>>>> PDFBox and then we need to do processing using OCR for further
>> accurate
>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>> those
>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>>>> 
>>>>> PDFBox can generate images (PDFToImage) and can extract text
>> (PDFToText).
>>>>> The goal of
>>>>> this project is to enhance PDFToText so that it can use OCR to extract
>>>>> text from areas of the
>>>>> document where the text is embedded as an image. Such PDF files are
>>>>> typically generated by
>>>>> scanners or fax machines. There is also another case where OCR is
>> useful:
>>>>> some fonts embedded
>>>>> in PDF files contain the wrong encoding, so when text is extracted with
>>>>> PDFToText the result is
>>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>>>> 
>>>>> Instead of:
>>>>> PDF => Image => OCR => Text
>>>>> 
>>>>> We want to do:
>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>> 
>>>>> -- John
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>> [email protected]
>>>>>>> wrote:
>>>>>> 
>>>>>>> Ok fixed. This is what I did
>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>>>> ->Source
>>>>>>> ->Add -> Project
>>>>>>> Then I selected PDFBox project.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>> [email protected]> wrote:
>>>>>>> 
>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>> application
>>>>>>>> project (say TestPDFBox) with a main class with following code.
>>>>>>>> 
>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>> PDPage();document.addPage( blankPage
>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>> 
>>>>>>>> Then I need to add those jar files generated in target folder of
>> PDFBox
>>>>>>>> to build path of my new project (I did build the PDFBox project from
>>>>>>>> source). That is what I did. But let's say I need to check  the
>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>> reference to
>>>>>>>> it's sources because I directly used generated jars. As Tilman said
>> I
>>>>> built
>>>>>>>> PDFBox from sources but I don't know a proper way to use it other
>>>>> projects
>>>>>>>> other than adding those jar files to build path.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
>> class
>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>>>>> command
>>>>>>>>> line argument.
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi John,
>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to
>>>>>>>>> build
>>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>> got a
>>>>>>>>> rough
>>>>>>>>>> idea about how they are working. To check them I used the jars in
>>>>>>>>> target
>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into
>> code
>>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper
>>>>> class.
>>>>>>>>>> What I usually do is adding some berakpoints and checking them in
>>>>> debug
>>>>>>>>>> windows. But using jars it's not possible. What is the way you
>> follow
>>>>>>>>> in
>>>>>>>>>> order to do such task?
>>>>>>>>>> 
>>>>>>>>>> As well I installed tesseract in to my machine and managed to do
>> some
>>>>>>>>> OCR
>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
>> mail.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]
>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>> 
>>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it
>>>>>>>>> contains
>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>> and details on how to obtain the source code and build PDFBox for
>>>>>>>>> yourself.
>>>>>>>>>>> 
>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the
>> only
>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
>>>>> under
>>>>>>>>> the
>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>> requirement.
>>>>>>>>>>> 
>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>> class
>>>>> to
>>>>>>>>> see
>>>>>>>>>>> how text and images are
>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
>>>>> glyph,
>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is
>>>>>>>>> currently
>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>> we have to go to great length to sort text back into reading
>> order
>>>>> and
>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>> is fundamentally a visual format, not a structured format like
>> HTML
>>>>> -
>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>> 
>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>> 
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>> 
>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>> questions.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>> [email protected]
>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>>>>>>>> University
>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with
>>>>>>>>> Apache
>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>>>> processing
>>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>> 2014
>>>>>>>>> project
>>>>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>>>>> university
>>>>>>>>>>> also we have done some research in OCR area and our group wrote a
>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>> systems(attached). Can
>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>>> 
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to