Dimuthu > There is a lot of code > fractions in current android jni wrapper which use "(jint)somePointer" > casting which will create terrible memory leaks in 64 bit environments > because ponters are 64 bit. So I believe writing it from the beginning is > much better.
That’s a classic 64-bit pitfall, well spotted. We definitely need to support 64-bit JVMs. > we can use > the static library of Leptonica (I did and it worked nicely). I think it is > not a issue to use it's static library because both Tesseract and Leptonica > is under apache licence. Sounds good, I found the following in the README: Leptonica is required. (www.leptonica.com). Tesseract no longer compiles without Leptonica. Which makes sense. -- John On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]> wrote: > Hi John, > +1 for you suggestion about converting image <=> byte array at java side. > It reduces lot of complexities. I don't know whether you have noticed or > not, jint data type in jni is a 32bit integer type. I noticed it in my Mac > but don't know about other operating systems. > > Leptonica is the image processing library for Tesseract [1]. What tesseract > do is using image processing algorithms in Leptonica to implement its OCR > algorithms. This [2] is the responsible .cpp file to create Tesseract API. > You can see it includes allheaders.h header file which is the main header > file of Leptonoca. So I think it is a must to build Leptonica first and > link it when we build Tesseract. This is not a big problem if we can use > the static library of Leptonica (I did and it worked nicely). I think it is > not a issue to use it's static library because both Tesseract and Leptonica > is under apache licence. > > I'm working on the maven implementation you have mentioned and will get > back to you soon. > > Thanks > Dimuthu > > > [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling > [2] > https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp > > > On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote: > >> Hi Dimuthu, >> >> 1,2,3: >> >> Feel free to write your own Tesseract binding or port the existing code as >> you see fit. >> The JNI binding should be minimal, only the methods you require need to be >> wrapped. >> Also, don't forget that some of the interop can be done in Java, for >> example if it is easier >> to convert a BufferedImage to a byte array in Java then do it there and >> pass the result >> to JNI rather than writing lots of JNI C++ to achieve the same result. >> >> Your GitHub repo looks like a good start, I can make comments there as >> things progress. >> >> Is it possible to build Tesseract without leptonica? I was under the >> impression that it was >> used for image i/o only, but I may be misinformed. >> >> 4: The native platform library should be built as part of the Maven build >> for the Tesseract >> wrapper which can be a separate project. The output can be a jar file >> which contains the >> native binaries. It should be possible for the jar to contain prebuilt >> binaries for all platforms >> but this is something we can worry about later. Right now the goal should >> be to build a jar >> containing just the current platform's native binary and any Java wrapper >> code. >> >> -- John >> >> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]> >> wrote: >> >>> Hi John, >>> >>> I tried to reuse that android jni wrapper for tesseract. Here is my >>> observation >>> >>> 1. This wrapper heavily depends on android image libraries. >>> (android/bitmap.h). Most of the wrapper methods [1] use this library. >>> >>> 2. But I can understand underlying logic in each function. Basically what >>> it does is mapping between tesseract api functions [2] with java methods. >>> In between it does to some image <=> byte array like conversions by using >>> that bitmap libraries in Android >>> >>> 3. There are two ways. 1: We can port it's code to make compatible with >> our >>> environments(linux,windows and mac) which is really painful. Also it will >>> cause memory leaks. 2: We can use only it's function signatures and >>> implement using our codes >>> >>> I think 2nd solution is better because we need only few operations to be >>> done using tesseract library. I have created a github repo [3] for this. >>> It's still not finished. I need to add some make files and build files to >>> make it run properly. And also I need to implement those wrapper >> functions >>> [3]. This may take some time. >>> >>> 4. Because we are calling native libraries we need different builds of >>> tesseract and leptonica libraries for each platform (dll for windows, so >>> for linux, dylib for mac). So we may need to build those libraries at the >>> time we build pdfbox project. Or we can pre build those libraries and add >>> them to the project as .dll, .so or .dylib format. What is the preferred >>> way? >>> >>> [1] >>> >> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp >>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample >>> [3] https://github.com/DImuthuUpe/Tesseract-API >>> [4] >>> >> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp >>> >>> Thanks >>> Dimuthu >>> >>> >>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha < >> [email protected] >>>> wrote: >>> >>>> I updated necessary changes to the document [1] >>>> >>>> For last two days I had a deep look at this [2] jni wrapper for >> tessaract >>>> api. >>>> Unfortunately this has been designed for Android environment so I think >> we >>>> need to write our own make files to build this in to a dll(windows) or >>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for >> a >>>> way to convert it to a make file that we can run on console. Please >> suggest >>>> if you have a better approach >>>> >>>> [1] >>>> >> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf >>>> [2] >>>> >> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ >>>> [3] >>>> >> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk >>>> >>>> >>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> wrote: >>>> >>>>> This is a good start. However, there is no need for the Adder >> component, >>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text >> Extractor". >>>>> >>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear >>>>> where the process starts. >>>>> >>>>> -- John >>>>> >>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <[email protected]> >>>>> wrote: >>>>> >>>>>> Sorry for the mistake. I added it to my Dropbox [1]. >>>>>> >>>>>> [1] >>>>>> >>>>> >> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>>>>> >>>>>> Thanks >>>>>> Dimuthu >>>>>> >>>>>> >>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> >> wrote: >>>>>> >>>>>>> I should add that the OCR engine should be pluggable so PDFToText >> might >>>>>>> use an interface, e.g. OCREngine and there will be a >> TesseractOCREngine >>>>>>> class somewhere which provides the required functionality and lives >> in >>>>> a >>>>>>> separate jar file. >>>>>>> >>>>>>> -- John >>>>>>> >>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> >> wrote: >>>>>>>> >>>>>>>> So do you need to embed those new functionalities into existing >>>>>>> PDFtoText algorithms or package them as a new sub system(something >>>>> like an >>>>>>> API)? >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: "John Hewson" <[email protected]> >>>>>>>> Sent: 26/02/2014 07:38 >>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>>>>>> Introduction >>>>>>>> >>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and >> page >>>>>>> rotation. >>>>>>>> >>>>>>>> There is another use case for OCR: some fonts embedded in PDFs have >>>>>>> corrupt encodings, which means the ACSII codes map to the wrong >>>>> glyphs. We >>>>>>> could OCR the glyphs to repair the encoding. >>>>>>>> >>>>>>>> -- John >>>>>>>> >>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha < >>>>> [email protected]> >>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi John, >>>>>>>>> Thanks for the explanation. >>>>>>>>> Let's say there is a pdf with both text in extractable format and >>>>> some >>>>>>>>> images with text(Scanned images). In that case first we extract >> those >>>>>>>>> extractable content using PDFBox algorithms and rest is extracted >>>>> using >>>>>>>>> OCR. Finally we pack both results together and give output as >>>>>>> PDFToText. Am >>>>>>>>> I correct? What do you mean by "location data"? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Dimuthu >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <[email protected]> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> 1. What is called "glyphs" ? >>>>>>>>>> >>>>>>>>>> http://en.wikipedia.org/wiki/Glyph >>>>>>>>>> >>>>>>>>>>> 2. What is the main requirement of this project? >>>>>>>>>>> As far as I understood, first we need to generate an image of >>>>>>>>>>> malformed pdfs from >>>>>>>>>>> PDFBox and then we need to do processing using OCR for further >>>>>>> accurate >>>>>>>>>>> results. But the problem is, why shouldn't we directly do OCR on >>>>>>> those >>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong. >>>>>>>>>> >>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text >>>>>>> (PDFToText). >>>>>>>>>> The goal of >>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to >>>>> extract >>>>>>>>>> text from areas of the >>>>>>>>>> document where the text is embedded as an image. Such PDF files >> are >>>>>>>>>> typically generated by >>>>>>>>>> scanners or fax machines. There is also another case where OCR is >>>>>>> useful: >>>>>>>>>> some fonts embedded >>>>>>>>>> in PDF files contain the wrong encoding, so when text is extracted >>>>> with >>>>>>>>>> PDFToText the result is >>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct >> letters. >>>>>>>>>> >>>>>>>>>> Instead of: >>>>>>>>>> PDF => Image => OCR => Text >>>>>>>>>> >>>>>>>>>> We want to do: >>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text >>>>>>>>>> >>>>>>>>>> -- John >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>>>>>>> [email protected] >>>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Ok fixed. This is what I did >>>>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations >>>>>>>>>> ->Source >>>>>>>>>>>> ->Add -> Project >>>>>>>>>>>> Then I selected PDFBox project. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Dimuthu >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java >>>>>>>>>> application >>>>>>>>>>>>> project (say TestPDFBox) with a main class with following code. >>>>>>>>>>>>> >>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new >>>>>>>>>> PDPage();document.addPage( blankPage >>>>>>>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>>>>>>> >>>>>>>>>>>>> Then I need to add those jar files generated in target folder >> of >>>>>>> PDFBox >>>>>>>>>>>>> to build path of my new project (I did build the PDFBox project >>>>> from >>>>>>>>>>>>> source). That is what I did. But let's say I need to check the >>>>>>>>>>>>> functionality of document.save("") method. But I don't have a >>>>>>>>>> reference to >>>>>>>>>>>>> it's sources because I directly used generated jars. As Tilman >>>>> said >>>>>>> I >>>>>>>>>> built >>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it >> other >>>>>>>>>> projects >>>>>>>>>>>>> other than adding those jar files to build path. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson < >> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Which IDE are you using? You should be able to run the >> PDFToText >>>>>>> class >>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as >> the >>>>>>>>>> command >>>>>>>>>>>>>> line argument. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- John >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>>>>>>> [email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and >>>>> managed to >>>>>>>>>>>>>> build >>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and >> I >>>>>>> got a >>>>>>>>>>>>>> rough >>>>>>>>>>>>>>> idea about how they are working. To check them I used the >> jars >>>>> in >>>>>>>>>>>>>> target >>>>>>>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look >>>>> into >>>>>>> code >>>>>>>>>>>>>>> specially how those processXXX() methods work in >>>>> PDFTextStripper >>>>>>>>>> class. >>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking >> them >>>>> in >>>>>>>>>> debug >>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way >> you >>>>>>> follow >>>>>>>>>>>>>> in >>>>>>>>>>>>>>> order to do such task? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to >>>>> do >>>>>>> some >>>>>>>>>>>>>> OCR >>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop >> you a >>>>>>> mail. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >>>>> [email protected] >>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Dimuthu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The PDFBox website can be found at >> http://pdfbox.apache.org/it >>>>>>>>>>>>>> contains >>>>>>>>>>>>>>>> a basic overview of the project >>>>>>>>>>>>>>>> and details on how to obtain the source code and build >> PDFBox >>>>> for >>>>>>>>>>>>>> yourself. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details >>>>> the >>>>>>> only >>>>>>>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are >>>>> all >>>>>>>>>> under >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> Apache license, which is a >>>>>>>>>>>>>>>> requirement. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer >>>>>>> class >>>>>>>>>> to >>>>>>>>>>>>>> see >>>>>>>>>>>>>>>> how text and images are >>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. >>>>> one >>>>>>>>>> glyph, >>>>>>>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how >> text >>>>> is >>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>> extracted, take a look at how >>>>>>>>>>>>>>>> we have to go to great length to sort text back into reading >>>>>>> order >>>>>>>>>> and >>>>>>>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format >> like >>>>>>> HTML >>>>>>>>>> - >>>>>>>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>>>>>>> difficult sometimes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The full PDF Reference document can be found at: >>>>>>>>>> >>>>>>> >>>>> >> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any >>>>>>>>>> questions. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>>>>>>> [email protected] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate >> at >>>>>>>>>>>>>> University >>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 >>>>> with >>>>>>>>>>>>>> Apache >>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image >>>>>>>>>> processing >>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my >> GSoC >>>>>>> 2014 >>>>>>>>>>>>>> project >>>>>>>>>>>>>>>> because I feel like it is the best suited project for me. In >>>>>>>>>>>>>> university >>>>>>>>>>>>>>>> also we have done some research in OCR area and our group >>>>> wrote a >>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>>>>>>> systems(attached). Can >>>>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1] >>>>>>>>>> >>>>>>> >>>>> >> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Regards >>>>>>>>>>>>> >>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>> Undergraduate >>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>> Undergraduate >>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>> >>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> W.Dimuthu Upeksha >>>>>>>>> Undergraduate >>>>>>>>> Department of Computer Science And Engineering >>>>>>>>> >>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >>>> >>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka >> >> > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka
