Thanks, I saw your new refactoring too, it’s good. Now the following methods are no longer needed:
public void setImagePath(String path) public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl) Cheers -- John On 11 Mar 2014, at 22:58, DImuthu Upeksha <[email protected]> wrote: > Hi John, > Yes. I implemented a new method to accept byte streams of the image as > an input. We directly can't send BufferedImage objects to native side. > So what I did is converting buffered image into a byte array and > passed it in to native side. At the native side it again converts in > to compatible format. With that request we need to pass some metadata > of byte stream like image width, height, bytes per pixel and bytes per > row. I checked it with this [2] test case and it works fine. > > [1] > https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74 > [2] > https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java > > Thanks > Dimuthu > > On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <[email protected]> wrote: >> Hi Dimuthu >> >> The Tesseract wrapper needs to take its input from a BufferedImage rather >> than reading a file from disk, so instead of: >> >> api.setImagePath("test.tif"); >> >> What we need is: >> >> BufferedImage image = ImageIO.read(new File("test.tif")); >> api.setImagePath(image); >> >> Because this will let us used the BufferedImage generated by PDFRenderer >> without round-tripping to the disk. >> >> -- John >> >> On 11 Mar 2014, at 11:13, DImuthu Upeksha <[email protected]> wrote: >> >>> Hi John, >>> Thanks for the guidance. >>> I did a small analysis of the accuracy and performance of new >>> Tesseract wrapper. I used this [1] image as the input image and got >>> following data [2] after OCR. First line is the recognised word >>> followed by location details (bounding box) of the word. I think these >>> details are pretty much enough for our task. Now what remaining is >>> converting pdf file into a image as you have mentioned. These days I'm >>> working on it. >>> >>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF >>> [2] https://gist.github.com/DImuthuUpe/9491660 >>> >>> Thanks >>> Dimuthu >>> >>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <[email protected]> wrote: >>>> Dimuthu, >>>> >>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can >>>>> be >>>>> build using maven. Some useful methods that are needed to do basic OCR >>>>> were >>>>> implemented. >>>> >>>> Great, it's looking good, nice and clean. >>>> >>>>> 1. What is the task of processStream method in PDFTextStripper class line >>>>> 456 : processStream( page.findResources(), content, page.findCropBox(), >>>>> page.findRotation() ); >>>> >>>> A PDF file is made up of pages, each of which contains a "content stream". >>>> This content stream contains a list of drawing commands such as "move to >>>> 10,15" or "write the word `foo`", these are called operators. The >>>> processStream function reads the stream for the current page and executes >>>> each of the operators. The operators themselves are implemented each in >>>> their own class which is a subclass of PDFOperator. The constructor of >>>> PDFStreamEngine creates the operator classes using reflection, which is >>>> rather odd and I'm not sure why this design was chosen. The operators used >>>> by PDFTextStripper can be found in >>>> org/apache/pdfbox/resources/PDFTextStripper.properties >>>> >>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the >>>>> better approach to do it? >>>> >>>> You could subclass PDFTextStripper and override the startDocument method >>>> and use it to create a PDFRenderer and store it in a field. Then override >>>> the processPage method and use the previously created PDFRenderer to >>>> render the current page to a buffered image and perform OCR on the image. >>>> Once you have the OCR text + positions, instead of calling processStream >>>> you can call processTextPosition once for each character + position. >>>> >>>> The PDFRenderer class was just added to the trunk, so make sure you do an >>>> "svn update". Let me know if you need me to change PDFTextStripper to make >>>> it easier to subclass. >>>> >>>> Cheers >>>> >>>> -- John >>>> >>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <[email protected]> >>>> wrote: >>>> >>>>> Hi John, >>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can >>>>> be >>>>> build using maven. Some useful methods that are needed to do basic OCR >>>>> were >>>>> implemented. >>>>> >>>>> I went through PDFBox code several times and got couple of issues that are >>>>> needed to be clarified >>>>> >>>>> 1. What is the task of processStream method in PDFTextStripper class line >>>>> 456 : processStream( page.findResources(), content, page.findCropBox(), >>>>> page.findRotation() ); >>>>> >>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the >>>>> better approach to do it? >>>>> >>>>> Thanks >>>>> Dimuthu >>>>> >>>>> >>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha >>>>> <[email protected]>wrote: >>>>> >>>>>> Hi John >>>>>> I refactored Tesseract JNI code to support maven build. To create the JNI >>>>>> library I added pre-built static libraries of Tesseract and Leptonica to >>>>>> resources folder[2]. For now it includes librararies supported for mac. >>>>>> But >>>>>> we can easily add both windows and linux libraries. After "mvn clean >>>>>> install", the jar is created under target folder. Now all setting up is >>>>>> done. What remains is implementing those native methods in >>>>>> tessbaseapi.cpp >>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern >>>>>> about project structure. >>>>>> >>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git >>>>>> [2] >>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources >>>>>> [3] >>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp >>>>>> >>>>>> Thanks >>>>>> Dimuthu >>>>>> >>>>>> >>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <[email protected]> wrote: >>>>>> >>>>>>> Dimuthu >>>>>>> >>>>>>>> There is a lot of code >>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer" >>>>>>>> casting which will create terrible memory leaks in 64 bit environments >>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning >>>>>>> is >>>>>>>> much better. >>>>>>> >>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to >>>>>>> support >>>>>>> 64-bit JVMs. >>>>>>> >>>>>>>> we can use >>>>>>>> the static library of Leptonica (I did and it worked nicely). I think >>>>>>> it is >>>>>>>> not a issue to use it's static library because both Tesseract and >>>>>>> Leptonica >>>>>>>> is under apache licence. >>>>>>> >>>>>>> Sounds good, I found the following in the README: >>>>>>> >>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles >>>>>>> without Leptonica. >>>>>>> >>>>>>> Which makes sense. >>>>>>> >>>>>>> -- John >>>>>>> >>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi John, >>>>>>>> +1 for you suggestion about converting image <=> byte array at java >>>>>>> side. >>>>>>>> It reduces lot of complexities. I don't know whether you have noticed >>>>>>>> or >>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my >>>>>>> Mac >>>>>>>> but don't know about other operating systems. >>>>>>>> >>>>>>>> Leptonica is the image processing library for Tesseract [1]. What >>>>>>> tesseract >>>>>>>> do is using image processing algorithms in Leptonica to implement its >>>>>>> OCR >>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract >>>>>>> API. >>>>>>>> You can see it includes allheaders.h header file which is the main >>>>>>> header >>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and >>>>>>>> link it when we build Tesseract. This is not a big problem if we can >>>>>>>> use >>>>>>>> the static library of Leptonica (I did and it worked nicely). I think >>>>>>> it is >>>>>>>> not a issue to use it's static library because both Tesseract and >>>>>>> Leptonica >>>>>>>> is under apache licence. >>>>>>>> >>>>>>>> I'm working on the maven implementation you have mentioned and will get >>>>>>>> back to you soon. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Dimuthu >>>>>>>> >>>>>>>> >>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling >>>>>>>> [2] >>>>>>>> >>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Dimuthu, >>>>>>>>> >>>>>>>>> 1,2,3: >>>>>>>>> >>>>>>>>> Feel free to write your own Tesseract binding or port the existing >>>>>>> code as >>>>>>>>> you see fit. >>>>>>>>> The JNI binding should be minimal, only the methods you require need >>>>>>> to be >>>>>>>>> wrapped. >>>>>>>>> Also, don't forget that some of the interop can be done in Java, for >>>>>>>>> example if it is easier >>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there >>>>>>>>> and >>>>>>>>> pass the result >>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result. >>>>>>>>> >>>>>>>>> Your GitHub repo looks like a good start, I can make comments there as >>>>>>>>> things progress. >>>>>>>>> >>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the >>>>>>>>> impression that it was >>>>>>>>> used for image i/o only, but I may be misinformed. >>>>>>>>> >>>>>>>>> 4: The native platform library should be built as part of the Maven >>>>>>> build >>>>>>>>> for the Tesseract >>>>>>>>> wrapper which can be a separate project. The output can be a jar file >>>>>>>>> which contains the >>>>>>>>> native binaries. It should be possible for the jar to contain prebuilt >>>>>>>>> binaries for all platforms >>>>>>>>> but this is something we can worry about later. Right now the goal >>>>>>> should >>>>>>>>> be to build a jar >>>>>>>>> containing just the current platform's native binary and any Java >>>>>>> wrapper >>>>>>>>> code. >>>>>>>>> >>>>>>>>> -- John >>>>>>>>> >>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi John, >>>>>>>>>> >>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my >>>>>>>>>> observation >>>>>>>>>> >>>>>>>>>> 1. This wrapper heavily depends on android image libraries. >>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library. >>>>>>>>>> >>>>>>>>>> 2. But I can understand underlying logic in each function. Basically >>>>>>> what >>>>>>>>>> it does is mapping between tesseract api functions [2] with java >>>>>>> methods. >>>>>>>>>> In between it does to some image <=> byte array like conversions by >>>>>>> using >>>>>>>>>> that bitmap libraries in Android >>>>>>>>>> >>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible >>>>>>> with >>>>>>>>> our >>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it >>>>>>> will >>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and >>>>>>>>>> implement using our codes >>>>>>>>>> >>>>>>>>>> I think 2nd solution is better because we need only few operations to >>>>>>> be >>>>>>>>>> done using tesseract library. I have created a github repo [3] for >>>>>>> this. >>>>>>>>>> It's still not finished. I need to add some make files and build >>>>>>> files to >>>>>>>>>> make it run properly. And also I need to implement those wrapper >>>>>>>>> functions >>>>>>>>>> [3]. This may take some time. >>>>>>>>>> >>>>>>>>>> 4. Because we are calling native libraries we need different builds >>>>>>>>>> of >>>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows, >>>>>>> so >>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at >>>>>>> the >>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and >>>>>>> add >>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the >>>>>>> preferred >>>>>>>>>> way? >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>>> >>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp >>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample >>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API >>>>>>>>>> [4] >>>>>>>>>> >>>>>>>>> >>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Dimuthu >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha < >>>>>>>>> [email protected] >>>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I updated necessary changes to the document [1] >>>>>>>>>>> >>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for >>>>>>>>> tessaract >>>>>>>>>>> api. >>>>>>>>>>> Unfortunately this has been designed for Android environment so I >>>>>>> think >>>>>>>>> we >>>>>>>>>>> need to write our own make files to build this in to a dll(windows) >>>>>>> or >>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching >>>>>>> for >>>>>>>>> a >>>>>>>>>>> way to convert it to a make file that we can run on console. Please >>>>>>>>> suggest >>>>>>>>>>> if you have a better approach >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> >>>>>>>>> >>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf >>>>>>>>>>> [2] >>>>>>>>>>> >>>>>>>>> >>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/ >>>>>>>>>>> [3] >>>>>>>>>>> >>>>>>>>> >>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <[email protected]> >>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> This is a good start. However, there is no need for the Adder >>>>>>>>> component, >>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text >>>>>>>>> Extractor". >>>>>>>>>>>> >>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it >>>>>>> clear >>>>>>>>>>>> where the process starts. >>>>>>>>>>>> >>>>>>>>>>>> -- John >>>>>>>>>>>> >>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha < >>>>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1]. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <[email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText >>>>>>>>> might >>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a >>>>>>>>> TesseractOCREngine >>>>>>>>>>>>>> class somewhere which provides the required functionality and >>>>>>> lives >>>>>>>>> in >>>>>>>>>>>> a >>>>>>>>>>>>>> separate jar file. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- John >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <[email protected]> >>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing >>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub >>>>>>>>>>>>>> system(something >>>>>>>>>>>> like an >>>>>>>>>>>>>> API)? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>> From: "John Hewson" <[email protected]> >>>>>>>>>>>>>>> Sent: 26/02/2014 07:38 >>>>>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project - >>>>>>>>>>>>>> Introduction >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and >>>>>>>>> page >>>>>>>>>>>>>> rotation. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs >>>>>>> have >>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong >>>>>>>>>>>> glyphs. We >>>>>>>>>>>>>> could OCR the glyphs to repair the encoding. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha < >>>>>>>>>>>> [email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>> Thanks for the explanation. >>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format >>>>>>> and >>>>>>>>>>>> some >>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract >>>>>>>>> those >>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is >>>>>>> extracted >>>>>>>>>>>> using >>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as >>>>>>>>>>>>>> PDFToText. Am >>>>>>>>>>>>>>>> I correct? What do you mean by "location data"? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson < >>>>>>> [email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. What is called "glyphs" ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project? >>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of >>>>>>>>>>>>>>>>>> malformed pdfs from >>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for >>>>>>>>>>>>>>>>>> further >>>>>>>>>>>>>> accurate >>>>>>>>>>>>>>>>>> results. But the problem is, why shouldn't we directly do >>>>>>> OCR on >>>>>>>>>>>>>> those >>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm >>>>>>> wrong. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text >>>>>>>>>>>>>> (PDFToText). >>>>>>>>>>>>>>>>> The goal of >>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to >>>>>>>>>>>> extract >>>>>>>>>>>>>>>>> text from areas of the >>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF >>>>>>>>>>>>>>>>> files >>>>>>>>> are >>>>>>>>>>>>>>>>> typically generated by >>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR >>>>>>> is >>>>>>>>>>>>>> useful: >>>>>>>>>>>>>>>>> some fonts embedded >>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is >>>>>>> extracted >>>>>>>>>>>> with >>>>>>>>>>>>>>>>> PDFToText the result is >>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct >>>>>>>>> letters. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Instead of: >>>>>>>>>>>>>>>>> PDF => Image => OCR => Text >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We want to do: >>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha < >>>>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did >>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug >>>>>>> Configurations >>>>>>>>>>>>>>>>> ->Source >>>>>>>>>>>>>>>>>>> ->Add -> Project >>>>>>>>>>>>>>>>>>> Then I selected PDFBox project. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new >>>>>>>>>>>>>>>>>>>> Java >>>>>>>>>>>>>>>>> application >>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following >>>>>>> code. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = >>>>>>> new >>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage >>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close(); >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target >>>>>>> folder >>>>>>>>> of >>>>>>>>>>>>>> PDFBox >>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox >>>>>>> project >>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check >>>>>>> the >>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have >>>>>>> a >>>>>>>>>>>>>>>>> reference to >>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As >>>>>>> Tilman >>>>>>>>>>>> said >>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>> built >>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it >>>>>>>>> other >>>>>>>>>>>>>>>>> projects >>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson < >>>>>>>>> [email protected]> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the >>>>>>>>> PDFToText >>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path >>>>>>> as >>>>>>>>> the >>>>>>>>>>>>>>>>> command >>>>>>>>>>>>>>>>>>>>> line argument. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha < >>>>>>>>>>>>>>>>> [email protected]> >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi John, >>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and >>>>>>>>>>>> managed to >>>>>>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned >>>>>>> and >>>>>>>>> I >>>>>>>>>>>>>> got a >>>>>>>>>>>>>>>>>>>>> rough >>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the >>>>>>>>> jars >>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>> target >>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in >>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further >>>>>>> look >>>>>>>>>>>> into >>>>>>>>>>>>>> code >>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in >>>>>>>>>>>> PDFTextStripper >>>>>>>>>>>>>>>>> class. >>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking >>>>>>>>> them >>>>>>>>>>>> in >>>>>>>>>>>>>>>>> debug >>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the >>>>>>>>>>>>>>>>>>>>>> way >>>>>>>>> you >>>>>>>>>>>>>> follow >>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>> order to do such task? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and >>>>>>> managed to >>>>>>>>>>>> do >>>>>>>>>>>>>> some >>>>>>>>>>>>>>>>>>>>> OCR >>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine. >>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop >>>>>>>>> you a >>>>>>>>>>>>>> mail. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson < >>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at >>>>>>>>> http://pdfbox.apache.org/it >>>>>>>>>>>>>>>>>>>>> contains >>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project >>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build >>>>>>>>> PDFBox >>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>> yourself. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 >>>>>>> details >>>>>>>>>>>> the >>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it. >>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue >>>>>>> are >>>>>>>>>>>> all >>>>>>>>>>>>>>>>> under >>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a >>>>>>>>>>>>>>>>>>>>>>> requirement. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the >>>>>>> PageDrawer >>>>>>>>>>>>>> class >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>> how text and images are >>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level >>>>>>> (e.g. >>>>>>>>>>>> one >>>>>>>>>>>>>>>>> glyph, >>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with >>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how >>>>>>>>> text >>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>> currently >>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how >>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into >>>>>>> reading >>>>>>>>>>>>>> order >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF >>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured >>>>>>>>>>>>>>>>>>>>>>> format >>>>>>>>> like >>>>>>>>>>>>>> HTML >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so >>>>>>>>>>>>>>>>>>>>>>> difficult sometimes. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask >>>>>>> any >>>>>>>>>>>>>>>>> questions. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- John >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha < >>>>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering >>>>>>> Undergraduate >>>>>>>>> at >>>>>>>>>>>>>>>>>>>>> University >>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC >>>>>>> 2013 >>>>>>>>>>>> with >>>>>>>>>>>>>>>>>>>>> Apache >>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and >>>>>>> image >>>>>>>>>>>>>>>>> processing >>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my >>>>>>>>> GSoC >>>>>>>>>>>>>> 2014 >>>>>>>>>>>>>>>>>>>>> project >>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for >>>>>>> me. In >>>>>>>>>>>>>>>>>>>>> university >>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our >>>>>>>>>>>>>>>>>>>>>>> group >>>>>>>>>>>> wrote a >>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR >>>>>>>>>>>>>>>>>>>>> systems(attached). Can >>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about >>>>>>> PDFBox? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thank you >>>>>>>>>>>>>>>>>>>>>>>> Dimuthu >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Regards >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Regards >>>>>>>>>>>>> >>>>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>>>> Undergraduate >>>>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>>>> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>>> Undergraduate >>>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>>> >>>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> W.Dimuthu Upeksha >>>>>>>>>> Undergraduate >>>>>>>>>> Department of Computer Science And Engineering >>>>>>>>>> >>>>>>>>>> University of Moratuwa, Sri Lanka >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards >>>>>>>> >>>>>>>> W.Dimuthu Upeksha >>>>>>>> Undergraduate >>>>>>>> Department of Computer Science And Engineering >>>>>>>> >>>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards >>>>> >>>>> W.Dimuthu Upeksha >>>>> Undergraduate >>>>> Department of Computer Science And Engineering >>>>> >>>>> University of Moratuwa, Sri Lanka >>>> >>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka >> > > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka
