Hi Dimuthu That’s great. We should wait until closer to the end of the GSoC period to integrate your work with PDFBox, as ideally we only want to have to do it once. We’ve not included C++ dependencies before so no, there won’t be a standard way, we’ll have to think something up. We’ll either make it an optional sub-project and the Tesseract JNI bindings might be better of having their own branch so that they are more like an external dependency - I’ll ask the dev mailing list.
To prepare your code for contribution you’ll need to add the Apache header to each.java file (see any PDFBox .java file for an example) and submit a signed ICLA http://www.apache.org/licenses/icla.pdf to Apache. Regarding additional functionality, the most useful would be for a new command line tool which could write the OCR’d text back into the original PDF file as “invisible text”, which would allow for copy and paste and text search to then work for that PDF file. A starting point for this would be to try and write the OCR’d text into the original PDF as “visible” text - we can make it invisible later! -- John On 19 Jun 2014, at 13:57, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > Hi John, > Except providing compatibility for platforms like windows, I think most of > the functionalities of OCR plugin are finished (Please correct me if I'm > wrong). But I would like to contribute to project further. Do you have > anything to add as a new functionality? And If you plan to add this to PDFBox > code, how should prepare my code? Is there any standard way? > > Thanks > Dimuthu > -- > Regards > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > University of Moratuwa, Sri Lanka