Hi John, I just noticed your last reply just after sending my previous mail. Sorry about that. I'm using Mac also and I'm also using VMs to test other platforms. I have done a lot of stuff using maven. I'll go through the plugin and try to apply it to that github project.
Thanks Dimuthu On Tue, Mar 4, 2014 at 6:11 AM, DImuthu Upeksha <[email protected]>wrote: > Hi John, > > I tried to reuse that android jni wrapper for tesseract. Here is my > observation > > 1. This wrapper heavily depends on android image libraries. > (android/bitmap.h). Most of the wrapper methods [1] use this library. > > 2. But I can understand underlying logic in each function. Basically what > it does is mapping between tesseract api functions [2] with java methods. > In between it does to some image <=> byte array like conversions by using > that bitmap libraries in Android > > 3. There are two ways. 1: We can port it's code to make compatible with > our environments(linux,windows and mac) which is really painful. Also it > will cause memory leaks. 2: We can use only it's function signatures and > implement using our codes > > I think 2nd solution is better because we need only few operations to be > done using tesseract library. I have created a github repo [3] for this. > It's still not finished. I need to add some make files and build files to > make it run properly. And also I need to implement those wrapper functions > [3]. This may take some time. > > 4. Because we are calling native libraries we need different builds of > tesseract and leptonica libraries for each platform (dll for windows, so > for linux, dylib for mac). So we may need to build those libraries at the > time we build pdfbox project. Or we can pre build those libraries and add > them to the project as .dll, .so or .dylib format. What is the preferred > way? > > [1] > https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp > [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample > [3] https://github.com/DImuthuUpe/Tesseract-API > [4] > https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp > > Thanks > Dimuthu > > > On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha < > [email protected]> wrote: > >> I updated necessary changes to the document [1] >> >> For last two days I had a deep look at this [2] jni wrapper for tessaract >> api. >> Unfortunately this has been designed for Android environment so I think >> we need to write our own make files to build this in to a dll(windows) or >> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a >> way to convert it to a make file that we can run on console. Please suggest >> if you have a better approach >> >> [1] >> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf >> [2] >> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ >> [3] >> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk >> >> >> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote: >> >>> This is a good start. However, there is no need for the Adder component, >>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor". >>> >>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear >>> where the process starts. >>> >>> -- John >>> >>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]> >>> wrote: >>> >>> > Sorry for the mistake. I added it to my Dropbox [1]. >>> > >>> > [1] >>> > >>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>> > >>> > Thanks >>> > Dimuthu >>> > >>> > >>> > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> >>> wrote: >>> > >>> >> I should add that the OCR engine should be pluggable so PDFToText >>> might >>> >> use an interface, e.g. OCREngine and there will be a >>> TesseractOCREngine >>> >> class somewhere which provides the required functionality and lives >>> in a >>> >> separate jar file. >>> >> >>> >> -- John >>> >> >>> >>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> >>> wrote: >>> >>> >>> >>> So do you need to embed those new functionalities into existing >>> >> PDFtoText algorithms or package them as a new sub system(something >>> like an >>> >> API)? >>> >>> >>> >>> -----Original Message----- >>> >>> From: "John Hewson" <[email protected]> >>> >>> Sent: 26/02/2014 07:38 >>> >>> To: "[email protected]" <[email protected]> >>> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>> >> Introduction >>> >>> >>> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page >>> >> rotation. >>> >>> >>> >>> There is another use case for OCR: some fonts embedded in PDFs have >>> >> corrupt encodings, which means the ACSII codes map to the wrong >>> glyphs. We >>> >> could OCR the glyphs to repair the encoding. >>> >>> >>> >>> -- John >>> >>> >>> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha < >>> [email protected]> >>> >> wrote: >>> >>>> >>> >>>> Hi John, >>> >>>> Thanks for the explanation. >>> >>>> Let's say there is a pdf with both text in extractable format and >>> some >>> >>>> images with text(Scanned images). In that case first we extract >>> those >>> >>>> extractable content using PDFBox algorithms and rest is extracted >>> using >>> >>>> OCR. Finally we pack both results together and give output as >>> >> PDFToText. Am >>> >>>> I correct? What do you mean by "location data"? >>> >>>> >>> >>>> Thanks >>> >>>> Dimuthu >>> >>>> >>> >>>> >>> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> >>> >> wrote: >>> >>>>> >>> >>>>> 1. What is called "glyphs" ? >>> >>>>> >>> >>>>> http://en.wikipedia.org/wiki/Glyph >>> >>>>> >>> >>>>>> 2. What is the main requirement of this project? >>> >>>>>> As far as I understood, first we need to generate an image of >>> >>>>>> malformed pdfs from >>> >>>>>> PDFBox and then we need to do processing using OCR for further >>> >> accurate >>> >>>>>> results. But the problem is, why shouldn't we directly do OCR on >>> >> those >>> >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong. >>> >>>>> >>> >>>>> PDFBox can generate images (PDFToImage) and can extract text >>> >> (PDFToText). >>> >>>>> The goal of >>> >>>>> this project is to enhance PDFToText so that it can use OCR to >>> extract >>> >>>>> text from areas of the >>> >>>>> document where the text is embedded as an image. Such PDF files are >>> >>>>> typically generated by >>> >>>>> scanners or fax machines. There is also another case where OCR is >>> >> useful: >>> >>>>> some fonts embedded >>> >>>>> in PDF files contain the wrong encoding, so when text is extracted >>> with >>> >>>>> PDFToText the result is >>> >>>>> nonsense but when drawn with PDFToImage we see the correct letters. >>> >>>>> >>> >>>>> Instead of: >>> >>>>> PDF => Image => OCR => Text >>> >>>>> >>> >>>>> We want to do: >>> >>>>> PDF => (Many images for words + location data => OCR) => Text >>> >>>>> >>> >>>>> -- John >>> >>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>> >>>>> [email protected] >>> >>>>>>> wrote: >>> >>>>>> >>> >>>>>>> Ok fixed. This is what I did >>> >>>>>>> Right click on the new project ->Debug As-> Debug Configurations >>> >>>>> ->Source >>> >>>>>>> ->Add -> Project >>> >>>>>>> Then I selected PDFBox project. >>> >>>>>>> >>> >>>>>>> Thanks >>> >>>>>>> Dimuthu >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>> >>>>>>> [email protected]> wrote: >>> >>>>>>> >>> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java >>> >>>>> application >>> >>>>>>>> project (say TestPDFBox) with a main class with following code. >>> >>>>>>>> >>> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new >>> >>>>> PDPage();document.addPage( blankPage >>> >>>>> );document.save("BlankPage.pdf");document.close(); >>> >>>>>>>> >>> >>>>>>>> Then I need to add those jar files generated in target folder of >>> >> PDFBox >>> >>>>>>>> to build path of my new project (I did build the PDFBox project >>> from >>> >>>>>>>> source). That is what I did. But let's say I need to check the >>> >>>>>>>> functionality of document.save("") method. But I don't have a >>> >>>>> reference to >>> >>>>>>>> it's sources because I directly used generated jars. As Tilman >>> said >>> >> I >>> >>>>> built >>> >>>>>>>> PDFBox from sources but I don't know a proper way to use it >>> other >>> >>>>> projects >>> >>>>>>>> other than adding those jar files to build path. >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected] >>> > >>> >>>>> wrote: >>> >>>>>>>> >>> >>>>>>>>> Which IDE are you using? You should be able to run the >>> PDFToText >>> >> class >>> >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as >>> the >>> >>>>> command >>> >>>>>>>>> line argument. >>> >>>>>>>>> >>> >>>>>>>>> -- John >>> >>>>>>>>> >>> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>> >>>>> [email protected]> >>> >>>>>>>>> wrote: >>> >>>>>>>>>> >>> >>>>>>>>>> Hi John, >>> >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and >>> managed to >>> >>>>>>>>> build >>> >>>>>>>>>> code successfully. I looked at the classes you mentioned and I >>> >> got a >>> >>>>>>>>> rough >>> >>>>>>>>>> idea about how they are working. To check them I used the >>> jars in >>> >>>>>>>>> target >>> >>>>>>>>>> folder to my separate java project. I tried samples in >>> >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look >>> into >>> >> code >>> >>>>>>>>>> specially how those processXXX() methods work in >>> PDFTextStripper >>> >>>>> class. >>> >>>>>>>>>> What I usually do is adding some berakpoints and checking >>> them in >>> >>>>> debug >>> >>>>>>>>>> windows. But using jars it's not possible. What is the way you >>> >> follow >>> >>>>>>>>> in >>> >>>>>>>>>> order to do such task? >>> >>>>>>>>>> >>> >>>>>>>>>> As well I installed tesseract in to my machine and managed to >>> do >>> >> some >>> >>>>>>>>> OCR >>> >>>>>>>>>> stuff also. That's a cool tool which works fine. >>> >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you >>> a >>> >> mail. >>> >>>>>>>>>> >>> >>>>>>>>>> Thanks >>> >>>>>>>>>> Dimuthu >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >>> [email protected] >>> >>> >>> >>>>>>>>> wrote: >>> >>>>>>>>>>> >>> >>>>>>>>>>> Hi Dimuthu >>> >>>>>>>>>>> >>> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it >>> >>>>>>>>> contains >>> >>>>>>>>>>> a basic overview of the project >>> >>>>>>>>>>> and details on how to obtain the source code and build >>> PDFBox for >>> >>>>>>>>> yourself. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details >>> the >>> >> only >>> >>>>>>>>>>> thoughts so far regarding it. >>> >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are >>> all >>> >>>>> under >>> >>>>>>>>> the >>> >>>>>>>>>>> Apache license, which is a >>> >>>>>>>>>>> requirement. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer >>> >> class >>> >>>>> to >>> >>>>>>>>> see >>> >>>>>>>>>>> how text and images are >>> >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. >>> one >>> >>>>> glyph, >>> >>>>>>>>>>> word, or sentence at a time) with >>> >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how >>> text is >>> >>>>>>>>> currently >>> >>>>>>>>>>> extracted, take a look at how >>> >>>>>>>>>>> we have to go to great length to sort text back into reading >>> >> order >>> >>>>> and >>> >>>>>>>>>>> infer the placement of diacritics - PDF >>> >>>>>>>>>>> is fundamentally a visual format, not a structured format >>> like >>> >> HTML >>> >>>>> - >>> >>>>>>>>>>> which is why extracting text can be so >>> >>>>>>>>>>> difficult sometimes. >>> >>>>>>>>>>> >>> >>>>>>>>>>> The full PDF Reference document can be found at: >>> >>>>> >>> >> >>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>> >>>>>>>>>>> >>> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any >>> >>>>> questions. >>> >>>>>>>>>>> >>> >>>>>>>>>>> Thanks, >>> >>>>>>>>>>> >>> >>>>>>>>>>> -- John >>> >>>>>>>>>>> >>> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>> >>>>> [email protected] >>> >>>>>>>>>> >>> >>>>>>>>>>> wrote: >>> >>>>>>>>>>> >>> >>>>>>>>>>>> Hi, >>> >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate >>> at >>> >>>>>>>>> University >>> >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 >>> with >>> >>>>>>>>> Apache >>> >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image >>> >>>>> processing >>> >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC >>> >> 2014 >>> >>>>>>>>> project >>> >>>>>>>>>>> because I feel like it is the best suited project for me. In >>> >>>>>>>>> university >>> >>>>>>>>>>> also we have done some research in OCR area and our group >>> wrote a >>> >>>>>>>>>>> literature review about increasing efficiency of OCR >>> >>>>>>>>> systems(attached). Can >>> >>>>>>>>>>> you please suggest me where to start learning about PDFBox? >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> [1] >>> >>>>> >>> >> >>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> Thank you >>> >>>>>>>>>>>> Dimuthu >>> >>>>>>>>>>>> >>> >>>>>>>>>>>> -- >>> >>>>>>>>>>>> Regards >>> >>>>>>>>>>>> W.Dimuthu Upeksha >>> >>>>>>>>>>>> Undergraduate >>> >>>>>>>>>>>> Department of Computer Science And Engineering >>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka >>> >>>>>>>>>> >>> >>>>>>>>>> >>> >>>>>>>>>> -- >>> >>>>>>>>>> Regards >>> >>>>>>>>>> >>> >>>>>>>>>> W.Dimuthu Upeksha >>> >>>>>>>>>> Undergraduate >>> >>>>>>>>>> Department of Computer Science And Engineering >>> >>>>>>>>>> >>> >>>>>>>>>> University of Moratuwa, Sri Lanka >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> -- >>> >>>>>>>> Regards >>> >>>>>>>> >>> >>>>>>>> W.Dimuthu Upeksha >>> >>>>>>>> Undergraduate >>> >>>>>>>> Department of Computer Science And Engineering >>> >>>>>>>> >>> >>>>>>>> University of Moratuwa, Sri Lanka >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> -- >>> >>>>>>> Regards >>> >>>>>>> >>> >>>>>>> W.Dimuthu Upeksha >>> >>>>>>> Undergraduate >>> >>>>>>> Department of Computer Science And Engineering >>> >>>>>>> >>> >>>>>>> University of Moratuwa, Sri Lanka >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> -- >>> >>>>>> Regards >>> >>>>>> >>> >>>>>> W.Dimuthu Upeksha >>> >>>>>> Undergraduate >>> >>>>>> Department of Computer Science And Engineering >>> >>>>>> >>> >>>>>> University of Moratuwa, Sri Lanka >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Regards >>> >>>> >>> >>>> W.Dimuthu Upeksha >>> >>>> Undergraduate >>> >>>> Department of Computer Science And Engineering >>> >>>> >>> >>>> University of Moratuwa, Sri Lanka >>> >> >>> > >>> > >>> > >>> > -- >>> > Regards >>> > >>> > W.Dimuthu Upeksha >>> > Undergraduate >>> > Department of Computer Science And Engineering >>> > >>> > University of Moratuwa, Sri Lanka >>> >>> >> >> >> -- >> Regards >> >> W.Dimuthu Upeksha >> Undergraduate >> Department of Computer Science And Engineering >> >> University of Moratuwa, Sri Lanka >> > > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka
