Re: Improving OCR plugin for PDFBox

John Hewson Thu, 26 Jun 2014 23:59:23 -0700

Hi Dimuthu

That’s great. We should wait until closer to the end of the GSoC period to 
integrate your work with PDFBox, as ideally we only want to have to do it once. 
We’ve not included C++ dependencies before so no, there won’t be a standard 
way, we’ll have to think something up. We’ll either make it an optional 
sub-project and the Tesseract JNI bindings might be better of having their own 
branch so that they are more like an external dependency - I’ll ask the dev 
mailing list.

To prepare your code for contribution you’ll need to add the Apache header to 
each.java file (see any PDFBox .java file for an example) and submit a signed 
ICLA http://www.apache.org/licenses/icla.pdf to Apache.

Regarding additional functionality, the most useful would be for a new command 
line tool which could write the OCR’d text back into the original PDF file as 
“invisible text”, which would allow for copy and paste and text search to then 
work for that PDF file. A starting point for this would be to try and write the 
OCR’d text into the original PDF as “visible” text - we can make it invisible 
later!

-- John

On 19 Jun 2014, at 13:57, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:

> Hi John,
> Except providing compatibility for platforms like windows, I think most of 
> the functionalities of OCR plugin are finished (Please correct me if I'm 
> wrong). But I would like to contribute to project further. Do  you have 
> anything to add as a new functionality? And If you plan to add this to PDFBox 
> code, how should prepare my code? Is there any standard way?
> 
> Thanks
> Dimuthu
> -- 
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka

Re: Improving OCR plugin for PDFBox

Reply via email to