Which IDE are you using? You should be able to run the PDFToText class (in 
pdfbox-tools) using your IDE and pass a PDF file path as the command line 
argument.

-- John

> On 24 Feb 2014, at 22:38, DImuthu Upeksha <[email protected]> wrote:
> 
> Hi John,
> Thanks for the reply. Yes I checked out PDFBox code and managed to build
> code successfully. I looked at the classes you mentioned and I got a rough
> idea about how they are working. To check them I used the jars in target
> folder to my separate java project. I tried samples in
> http://pdfbox.apache.org/cookbook/. I need to further look into code
> specially how those processXXX() methods work in PDFTextStripper class.
> What I usually do is adding some berakpoints and checking them in debug
> windows. But using jars it's not possible. What is the way you follow in
> order to do such task?
> 
> As well I installed tesseract in to my machine and managed to do some OCR
> stuff also. That's a cool tool which works fine.
> I'm still learning the code. If I get any issue I'll drop you a mail.
> 
> Thanks
> Dimuthu
> 
> 
>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]> wrote:
>> 
>> Hi Dimuthu
>> 
>> The PDFBox website can be found at http://pdfbox.apache.org/ it contains
>> a basic overview of the project
>> and details on how to obtain the source code and build PDFBox for yourself.
>> 
>> Currently we do not perform any OCR and PDFBOX-1912 details the only
>> thoughts so far regarding it.
>> Note that the OCR libraries mentioned in the JIRA issue are all under the
>> Apache license, which is a
>> requirement.
>> 
>> Once you have the source code, take a look at the PageDrawer class to see
>> how text and images are
>> rendered. We want someone to interface at a low-level (e.g. one glyph,
>> word, or sentence at a time) with
>> an OCR engine. Also look at PDFTextStripper which is how text is currently
>> extracted, take a look at how
>> we have to go to great length to sort text back into reading order and
>> infer the placement of diacritics - PDF
>> is fundamentally a visual format, not a structured format like HTML -
>> which is why extracting text can be so
>> difficult sometimes.
>> 
>> The full PDF Reference document can be found at:
>> 
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>> 
>> Feel free to discuss specifics of your proposal or ask any questions.
>> 
>> Thanks,
>> 
>> -- John
>> 
>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]>
>> wrote:
>> 
>>> Hi,
>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University
>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache
>> ISIS [1] project. I'm very much interested in OCR and image processing
>> stuff. So I would like to select this project idea as my GSoC 2014 project
>> because I feel like it is the best suited project for me. In university
>> also we have done some research in OCR area and our group wrote a
>> literature review about increasing efficiency of OCR systems(attached). Can
>> you please suggest me where to start learning about PDFBox?
>>> 
>>> [1]
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>> 
>>> Thank you
>>> Dimuthu
>>> 
>>> --
>>> Regards
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> University of Moratuwa, Sri Lanka
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Reply via email to