Re: [GSoC 2014]Optical Character Recognition project - Introduction

John Hewson Tue, 25 Feb 2014 18:09:12 -0800

Yes, exactly. By location data I just mean (x,y) coordinates and page rotation.


There is another use case for OCR: some fonts embedded in PDFs have corrupt 
encodings, which means the ACSII codes map to the wrong glyphs. We could OCR 
the glyphs to repair the encoding.

-- John

> On 25 Feb 2014, at 17:13, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:
> 
> Hi John,
> Thanks for the explanation.
> Let's say there is a pdf with both text in extractable format and some
> images with text(Scanned images). In that case first we extract those
> extractable content using PDFBox algorithms and rest is extracted using
> OCR. Finally we pack both results together and give output as PDFToText. Am
> I correct? What do you mean by "location data"?
> 
> Thanks
> Dimuthu
> 
> 
>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <j...@jahewson.com> wrote:
>> 
>> 1. What is called "glyphs" ?
>> 
>> http://en.wikipedia.org/wiki/Glyph
>> 
>>> 2. What is the main requirement of this project?
>>> As far as I understood, first we need to generate an image of
>>> malformed pdfs from
>>> PDFBox and then we need to do processing using OCR for further accurate
>>> results.  But the problem is, why shouldn't we directly do OCR on those
>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>> 
>> PDFBox can generate images (PDFToImage) and can extract text (PDFToText).
>> The goal of
>> this project is to enhance PDFToText so that it can use OCR to extract
>> text from areas of the
>> document where the text is embedded as an image. Such PDF files are
>> typically generated by
>> scanners or fax machines. There is also another case where OCR is useful:
>> some fonts embedded
>> in PDF files contain the wrong encoding, so when text is extracted with
>> PDFToText the result is
>> nonsense but when drawn with PDFToImage we see the correct letters.
>> 
>> Instead of:
>> PDF => Image => OCR => Text
>> 
>> We want to do:
>> PDF => (Many images for words + location data => OCR) => Text
>> 
>> -- John
>> 
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>> dimuthu.upeks...@gmail.com
>>>> wrote:
>>> 
>>>> Ok fixed. This is what I did
>>>> Right click on the new project ->Debug As-> Debug Configurations
>> ->Source
>>>> ->Add -> Project
>>>> Then I selected PDFBox project.
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>> dimuthu.upeks...@gmail.com> wrote:
>>>> 
>>>>> I'm using eclipse. This is what I want. I created a new Java
>> application
>>>>> project (say TestPDFBox) with a main class with following code.
>>>>> 
>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>> PDPage();document.addPage( blankPage
>> );document.save("BlankPage.pdf");document.close();
>>>>> 
>>>>> Then I need to add those jar files generated in target folder of PDFBox
>>>>> to build path of my new project (I did build the PDFBox project from
>>>>> source). That is what I did. But let's say I need to check  the
>>>>> functionality of document.save("") method. But I don't have a
>> reference to
>>>>> it's sources because I directly used generated jars. As Tilman said I
>> built
>>>>> PDFBox from sources but I don't know a proper way to use it other
>> projects
>>>>> other than adding those jar files to build path.
>>>>> 
>>>>> 
>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <j...@jahewson.com>
>> wrote:
>>>>> 
>>>>>> Which IDE are you using? You should be able to run the PDFToText class
>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>> command
>>>>>> line argument.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>> dimuthu.upeks...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi John,
>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed to
>>>>>> build
>>>>>>> code successfully. I looked at the classes you mentioned and I got a
>>>>>> rough
>>>>>>> idea about how they are working. To check them I used the jars in
>>>>>> target
>>>>>>> folder to my separate java project. I tried samples in
>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into code
>>>>>>> specially how those processXXX() methods work in PDFTextStripper
>> class.
>>>>>>> What I usually do is adding some berakpoints and checking them in
>> debug
>>>>>>> windows. But using jars it's not possible. What is the way you follow
>>>>>> in
>>>>>>> order to do such task?
>>>>>>> 
>>>>>>> As well I installed tesseract in to my machine and managed to do some
>>>>>> OCR
>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>> I'm still learning the code. If I get any issue I'll drop you a mail.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <j...@jahewson.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Dimuthu
>>>>>>>> 
>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/ it
>>>>>> contains
>>>>>>>> a basic overview of the project
>>>>>>>> and details on how to obtain the source code and build PDFBox for
>>>>>> yourself.
>>>>>>>> 
>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the only
>>>>>>>> thoughts so far regarding it.
>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
>> under
>>>>>> the
>>>>>>>> Apache license, which is a
>>>>>>>> requirement.
>>>>>>>> 
>>>>>>>> Once you have the source code, take a look at the PageDrawer class
>> to
>>>>>> see
>>>>>>>> how text and images are
>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
>> glyph,
>>>>>>>> word, or sentence at a time) with
>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text is
>>>>>> currently
>>>>>>>> extracted, take a look at how
>>>>>>>> we have to go to great length to sort text back into reading order
>> and
>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>> is fundamentally a visual format, not a structured format like HTML
>> -
>>>>>>>> which is why extracting text can be so
>>>>>>>> difficult sometimes.
>>>>>>>> 
>>>>>>>> The full PDF Reference document can be found at:
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>> 
>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>> questions.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>> dimuthu.upeks...@gmail.com
>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>>>>> University
>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with
>>>>>> Apache
>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>> processing
>>>>>>>> stuff. So I would like to select this project idea as my GSoC 2014
>>>>>> project
>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>> university
>>>>>>>> also we have done some research in OCR area and our group wrote a
>>>>>>>> literature review about increasing efficiency of OCR
>>>>>> systems(attached). Can
>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>> 
>>>>>>>>> [1]
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>> 
>>>>>>>>> Thank you
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Reply via email to