Hi Dimuthu The Tesseract wrapper needs to take its input from a BufferedImage rather than reading a file from disk, so instead of:
api.setImagePath(“test.tif”); What we need is: BufferedImage image = ImageIO.read(new File(“test.tif")); api.setImagePath(image); Because this will let us used the BufferedImage generated by PDFRenderer without round-tripping to the disk. -- John On 11 Mar 2014, at 11:13, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > Hi John, > Thanks for the guidance. > I did a small analysis of the accuracy and performance of new > Tesseract wrapper. I used this [1] image as the input image and got > following data [2] after OCR. First line is the recognised word > followed by location details (bounding box) of the word. I think these > details are pretty much enough for our task. Now what remaining is > converting pdf file into a image as you have mentioned. These days I'm > working on it. > > [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF > [2] https://gist.github.com/DImuthuUpe/9491660 > > Thanks > Dimuthu > > On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <j...@jahewson.com> wrote: >> Dimuthu, >> >>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be >>> build using maven. Some useful methods that are needed to do basic OCR were >>> implemented. >> >> Great, it's looking good, nice and clean. >> >>> 1. What is the task of processStream method in PDFTextStripper class line >>> 456 : processStream( page.findResources(), content, page.findCropBox(), >>> page.findRotation() ); >> >> A PDF file is made up of pages, each of which contains a "content stream". >> This content stream contains a list of drawing commands such as "move to >> 10,15" or "write the word `foo`", these are called operators. The >> processStream function reads the stream for the current page and executes >> each of the operators. The operators themselves are implemented each in >> their own class which is a subclass of PDFOperator. The constructor of >> PDFStreamEngine creates the operator classes using reflection, which is >> rather odd and I'm not sure why this design was chosen. The operators used >> by PDFTextStripper can be found in >> org/apache/pdfbox/resources/PDFTextStripper.properties >> >>> 2. Say I need to extract images and it's metadata from a pdf. What is the >>> better approach to do it? >> >> You could subclass PDFTextStripper and override the startDocument method and >> use it to create a PDFRenderer and store it in a field. Then override the >> processPage method and use the previously created PDFRenderer to render the >> current page to a buffered image and perform OCR on the image. Once you have >> the OCR text + positions, instead of calling processStream you can call >> processTextPosition once for each character + position. >> >> The PDFRenderer class was just added to the trunk, so make sure you do an >> "svn update". Let me know if you need me to change PDFTextStripper to make >> it easier to subclass. >> >> Cheers >> >> -- John >> >> On 9 Mar 2014, at 09:08, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: >> >>> Hi John, >>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be >>> build using maven. Some useful methods that are needed to do basic OCR were >>> implemented. >>> >>> I went through PDFBox code several times and got couple of issues that are >>> needed to be clarified >>> >>> 1. What is the task of processStream method in PDFTextStripper class line >>> 456 : processStream( page.findResources(), content, page.findCropBox(), >>> page.findRotation() ); >>> >>> 2. Say I need to extract images and it's metadata from a pdf. What is the >>> better approach to do it? >>> >>> Thanks >>> Dimuthu >>> >>> >>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha >>> <dimuthu.upeks...@gmail.com>wrote: >>> >>>> Hi John >>>> I refactored Tesseract JNI code to support maven build. To create the JNI >>>> library I added pre-built static libraries of Tesseract and Leptonica to >>>> resources folder[2]. For now it includes librararies supported for mac. But >>>> we can easily add both windows and linux libraries. After "mvn clean >>>> install", the jar is created under target folder. Now all setting up is >>>> done. What remains is implementing those native methods in tessbaseapi.cpp >>>> [3]. Hope to finish it asap. Please let me know if there is any concern >>>> about project structure. >>>> >>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git >>>> [2] >>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources >>>> [3] >>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp >>>> >>>> Thanks >>>> Dimuthu >>>> >>>> >>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >>>> >>>>> Dimuthu >>>>> >>>>>> There is a lot of code >>>>>> fractions in current android jni wrapper which use "(jint)somePointer" >>>>>> casting which will create terrible memory leaks in 64 bit environments >>>>>> because ponters are 64 bit. So I believe writing it from the beginning >>>>> is >>>>>> much better. >>>>> >>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to >>>>> support >>>>> 64-bit JVMs. >>>>> >>>>>> we can use >>>>>> the static library of Leptonica (I did and it worked nicely). I think >>>>> it is >>>>>> not a issue to use it's static library because both Tesseract and >>>>> Leptonica >>>>>> is under apache licence. >>>>> >>>>> Sounds good, I found the following in the README: >>>>> >>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles >>>>> without Leptonica. >>>>> >>>>> Which makes sense. >>>>> >>>>> -- John >>>>> >>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi John, >>>>>> +1 for you suggestion about converting image <=> byte array at java >>>>> side. >>>>>> It reduces lot of complexities. I don't know whether you have noticed or >>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my >>>>> Mac >>>>>> but don't know about other operating systems. >>>>>> >>>>>> Leptonica is the image processing library for Tesseract [1]. What >>>>> tesseract >>>>>> do is using image processing algorithms in Leptonica to implement its >>>>> OCR >>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract >>>>> API. >>>>>> You can see it includes allheaders.h header file which is the main >>>>> header >>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and >>>>>> link it when we build Tesseract. This is not a big problem if we can use >>>>>> the static library of Leptonica (I did and it worked nicely). I think >>>>> it is >>>>>> not a issue to use it's static library because both Tesseract and >>>>> Leptonica >>>>>> is under apache licence. >>>>>> >>>>>> I'm working on the maven implementation you have mentioned and will get >>>>>> back to you soon. >>>>>> >>>>>> Thanks >>>>>> Dimuthu >>>>>> >>>>>> >>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling >>>>>> [2] >>>>>> >>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp >>>>>> >>>>>> >>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <j...@jahewson.com> wrote: >>>>>> >>>>>>> Hi Dimuthu, >>>>>>> >>>>>>> 1,2,3: >>>>>>> >>>>>>> Feel free to write your own Tesseract binding or port the existing >>>>> code as >>>>>>> you see fit. >>>>>>> The JNI binding should be minimal, only the methods you require need >>>>> to be >>>>>>> wrapped. >>>>>>> Also, don't forget that some of the interop can be done in Java, for >>>>>>> example if it is easier >>>>>>> to convert a BufferedImage to a byte array in Java then do it there and >>>>>>> pass the result >>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result. >>>>>>> >>>>>>> Your GitHub repo looks like a good start, I can make comments there as >>>>>>> things progress. >>>>>>> >>>>>>> Is it possible to build Tesseract without leptonica? I was under the >>>>>>> impression that it was >>>>>>> used for image i/o only, but I may be misinformed. >>>>>>> >>>>>>> 4: The native platform library should be built as part of the Maven >>>>> build >>>>>>> for the Tesseract >>>>>>> wrapper which can be a separate project. The output can be a jar file >>>>>>> which contains the >>>>>>> native binaries. It should be possible for the jar to contain prebuilt >>>>>>> binaries for all platforms >>>>>>> but this is something we can worry about later. Right now the goal >>>>> should >>>>>>> be to build a jar >>>>>>> containing just the current platform's native binary and any Java >>>>> wrapper >>>>>>> code. >>>>>>> >>>>>>> -- John >>>>>>> >>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <dimuthu.upeks...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi John, >>>>>>>> >>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my >>>>>>>> observation >>>>>>>> >>>>>>>> 1. This wrapper heavily depends on android image libraries. >>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library. >>>>>>>> >>>>>>>> 2. But I can understand underlying logic in each function. Basically >>>>> what >>>>>>>> it does is mapping between tesseract api functions [2] with java >>>>> methods. >>>>>>>> In between it does to some image <=> byte array like conversions by >>>>> using >>>>>>>> that bitmap libraries in Android >>>>>>>> >>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible >>>>> with >>>>>>> our >>>>>>>> environments(linux,windows and mac) which is really painful. Also it >>>>> will >>>>>>>> cause memory leaks. 2: We can use only it's function signatures and >>>>>>>> implement using our codes >>>>>>>> >>>>>>>> I think 2nd solution is better because we need only few operations to >>>>> be >>>>>>>> done using tesseract library. I have created a github repo [3] for >>>>> this. >>>>>>>> It's still not finished. I need to add some make files and build >>>>> files to >>>>>>>> make it run properly. And also I need to implement those wrapper >>>>>>> functions >>>>>>>> [3]. This may take some time. >>>>>>>> >>>>>>>> 4. Because we are calling native libraries we need different builds of >>>>>>>> tesseract and leptonica libraries for each platform (dll for windows, >>>>> so >>>>>>>> for linux, dylib for mac). So we may need to build those libraries at >>>>> the >>>>>>>> time we build pdfbox project. Or we can pre build those libraries and >>>>> add >>>>>>>> them to the project as .dll, .so or .dylib format. What is the >>>>> preferred >>>>>>>> way? >>>>>>>> >>>>>>>> [1] >>>>>>>> >>>>>>> >>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp >>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample >>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API >>>>>>>> [4] >>>>>>>> >>>>>>> >>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp >>>>>>>> >>>>>>>> Thanks >>>>>>>> Dimuthu >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha < >>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>> wrote: >>>>>>>> >>>>>>>>> I updated necessary changes to the document [1] >>>>>>>>> >>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for >>>>>>> tessaract >>>>>>>>> api. >>>>>>>>> Unfortunately this has been designed for Android environment so I >>>>> think >>>>>>> we >>>>>>>>> need to write our own make files to build this in to a dll(windows) >>>>> or >>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching >>>>> for >>>>>>> a >>>>>>>>> way to convert it to a make file that we can run on console. Please >>>>>>> suggest >>>>>>>>> if you have a better approach >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> >>>>>>> >>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf >>>>>>>>> [2] >>>>>>>>> >>>>>>> >>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ >>>>>>>>> [3] >>>>>>>>> >>>>>>> >>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <j...@jahewson.com> >>>>> wrote: >>>>>>>>> >>>>>>>>>> This is a good start. However, there is no need for the Adder >>>>>>> component, >>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text >>>>>>> Extractor". >>>>>>>>>> >>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it >>>>> clear >>>>>>>>>> where the process starts. >>>>>>>>>> >>>>>>>>>> -- John >>>>>>>>>> >>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha < >>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1]. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Dimuthu >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <j...@jahewson.com> >>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText >>>>>>> might >>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a >>>>>>> TesseractOCREngine >>>>>>>>>>>> class somewhere which provides the required functionality and >>>>> lives >>>>>>> in >>>>>>>>>> a >>>>>>>>>>>> separate jar file. >>>>>>>>>>>> >>>>>>>>>>>> -- John >>>>>>>>>>>> >>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <dimuthu.upeks...@gmail.com> >>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> So do you need to embed those new functionalities into existing >>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something >>>>>>>>>> like an >>>>>>>>>>>> API)? >>>>>>>>>>>>> >>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>> From: "John Hewson" <j...@jahewson.com> >>>>>>>>>>>>> Sent: 26/02/2014 07:38 >>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <dev@pdfbox.apache.org> >>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>>>>>>>>>>> Introduction >>>>>>>>>>>>> >>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and >>>>>>> page >>>>>>>>>>>> rotation. >>>>>>>>>>>>> >>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs >>>>> have >>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong >>>>>>>>>> glyphs. We >>>>>>>>>>>> could OCR the glyphs to repair the encoding. >>>>>>>>>>>>> >>>>>>>>>>>>> -- John >>>>>>>>>>>>> >>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha < >>>>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>> Thanks for the explanation. >>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format >>>>> and >>>>>>>>>> some >>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract >>>>>>> those >>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is >>>>> extracted >>>>>>>>>> using >>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as >>>>>>>>>>>> PDFToText. Am >>>>>>>>>>>>>> I correct? What do you mean by "location data"? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson < >>>>> j...@jahewson.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. What is called "glyphs" ? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2. What is the main requirement of this project? >>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of >>>>>>>>>>>>>>>> malformed pdfs from >>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further >>>>>>>>>>>> accurate >>>>>>>>>>>>>>>> results. But the problem is, why shouldn't we directly do >>>>> OCR on >>>>>>>>>>>> those >>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm >>>>> wrong. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text >>>>>>>>>>>> (PDFToText). >>>>>>>>>>>>>>> The goal of >>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to >>>>>>>>>> extract >>>>>>>>>>>>>>> text from areas of the >>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files >>>>>>> are >>>>>>>>>>>>>>> typically generated by >>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR >>>>> is >>>>>>>>>>>> useful: >>>>>>>>>>>>>>> some fonts embedded >>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is >>>>> extracted >>>>>>>>>> with >>>>>>>>>>>>>>> PDFToText the result is >>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct >>>>>>> letters. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Instead of: >>>>>>>>>>>>>>> PDF => Image => OCR => Text >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We want to do: >>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ok fixed. This is what I did >>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug >>>>> Configurations >>>>>>>>>>>>>>> ->Source >>>>>>>>>>>>>>>>> ->Add -> Project >>>>>>>>>>>>>>>>> Then I selected PDFBox project. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java >>>>>>>>>>>>>>> application >>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following >>>>> code. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = >>>>> new >>>>>>>>>>>>>>> PDPage();document.addPage( blankPage >>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target >>>>> folder >>>>>>> of >>>>>>>>>>>> PDFBox >>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox >>>>> project >>>>>>>>>> from >>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check >>>>> the >>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have >>>>> a >>>>>>>>>>>>>>> reference to >>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As >>>>> Tilman >>>>>>>>>> said >>>>>>>>>>>> I >>>>>>>>>>>>>>> built >>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it >>>>>>> other >>>>>>>>>>>>>>> projects >>>>>>>>>>>>>>>>>> other than adding those jar files to build path. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson < >>>>>>> j...@jahewson.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the >>>>>>> PDFToText >>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path >>>>> as >>>>>>> the >>>>>>>>>>>>>>> command >>>>>>>>>>>>>>>>>>> line argument. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and >>>>>>>>>> managed to >>>>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned >>>>> and >>>>>>> I >>>>>>>>>>>> got a >>>>>>>>>>>>>>>>>>> rough >>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the >>>>>>> jars >>>>>>>>>> in >>>>>>>>>>>>>>>>>>> target >>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further >>>>> look >>>>>>>>>> into >>>>>>>>>>>> code >>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in >>>>>>>>>> PDFTextStripper >>>>>>>>>>>>>>> class. >>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking >>>>>>> them >>>>>>>>>> in >>>>>>>>>>>>>>> debug >>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way >>>>>>> you >>>>>>>>>>>> follow >>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>> order to do such task? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and >>>>> managed to >>>>>>>>>> do >>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>> OCR >>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop >>>>>>> you a >>>>>>>>>>>> mail. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >>>>>>>>>> j...@jahewson.com >>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Dimuthu >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at >>>>>>> http://pdfbox.apache.org/it >>>>>>>>>>>>>>>>>>> contains >>>>>>>>>>>>>>>>>>>>> a basic overview of the project >>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build >>>>>>> PDFBox >>>>>>>>>> for >>>>>>>>>>>>>>>>>>> yourself. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 >>>>> details >>>>>>>>>> the >>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue >>>>> are >>>>>>>>>> all >>>>>>>>>>>>>>> under >>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> Apache license, which is a >>>>>>>>>>>>>>>>>>>>> requirement. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the >>>>> PageDrawer >>>>>>>>>>>> class >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>> how text and images are >>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level >>>>> (e.g. >>>>>>>>>> one >>>>>>>>>>>>>>> glyph, >>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how >>>>>>> text >>>>>>>>>> is >>>>>>>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>>>>>>> extracted, take a look at how >>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into >>>>> reading >>>>>>>>>>>> order >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format >>>>>>> like >>>>>>>>>>>> HTML >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>>>>>>>>>>>> difficult sometimes. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at: >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask >>>>> any >>>>>>>>>>>>>>> questions. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>>>>>>>>>>>> dimuthu.upeks...@gmail.com >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering >>>>> Undergraduate >>>>>>> at >>>>>>>>>>>>>>>>>>> University >>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC >>>>> 2013 >>>>>>>>>> with >>>>>>>>>>>>>>>>>>> Apache >>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and >>>>> image >>>>>>>>>>>>>>> processing >>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my >>>>>>> GSoC >>>>>>>>>>>> 2014 >>>>>>>>>>>>>>>>>>> project >>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for >>>>> me. In >>>>>>>>>>>>>>>>>>> university >>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group >>>>>>>>>> wrote a >>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>>>>>>>>>>>> systems(attached). Can >>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about >>>>> PDFBox? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>> >>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Regards >>>>>>>>>>>>>> >>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>> >>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>> Undergraduate >>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>> >>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> W.Dimuthu Upeksha >>>>>>>>> Undergraduate >>>>>>>>> Department of Computer Science And Engineering >>>>>>>>> >>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards >>>>>>>> >>>>>>>> W.Dimuthu Upeksha >>>>>>>> Undergraduate >>>>>>>> Department of Computer Science And Engineering >>>>>>>> >>>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >>>> >>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka >> > > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka