I'm using eclipse. This is what I want. I created a new Java application
project (say TestPDFBox) with a main class with following code.
PDDocument document = new PDDocument();PDPage blankPage = new
PDPage();document.addPage( blankPage
);document.save("BlankPage.pdf");document.close();
Then I need to add those jar files generated in target folder of PDFBox to
build path of my new project (I did build the PDFBox project from source).
That is what I did. But let's say I need to check the functionality of
document.save("") method. But I don't have a reference to it's sources
because I directly used generated jars. As Tilman said I built PDFBox from
sources but I don't know a proper way to use it other projects other than
adding those jar files to build path.
On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <[email protected]> wrote:
> Which IDE are you using? You should be able to run the PDFToText class (in
> pdfbox-tools) using your IDE and pass a PDF file path as the command line
> argument.
>
> -- John
>
> > On 24 Feb 2014, at 22:38, DImuthu Upeksha <[email protected]>
> wrote:
> >
> > Hi John,
> > Thanks for the reply. Yes I checked out PDFBox code and managed to build
> > code successfully. I looked at the classes you mentioned and I got a
> rough
> > idea about how they are working. To check them I used the jars in target
> > folder to my separate java project. I tried samples in
> > http://pdfbox.apache.org/cookbook/. I need to further look into code
> > specially how those processXXX() methods work in PDFTextStripper class.
> > What I usually do is adding some berakpoints and checking them in debug
> > windows. But using jars it's not possible. What is the way you follow in
> > order to do such task?
> >
> > As well I installed tesseract in to my machine and managed to do some OCR
> > stuff also. That's a cool tool which works fine.
> > I'm still learning the code. If I get any issue I'll drop you a mail.
> >
> > Thanks
> > Dimuthu
> >
> >
> >> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <[email protected]>
> wrote:
> >>
> >> Hi Dimuthu
> >>
> >> The PDFBox website can be found at http://pdfbox.apache.org/ it
> contains
> >> a basic overview of the project
> >> and details on how to obtain the source code and build PDFBox for
> yourself.
> >>
> >> Currently we do not perform any OCR and PDFBOX-1912 details the only
> >> thoughts so far regarding it.
> >> Note that the OCR libraries mentioned in the JIRA issue are all under
> the
> >> Apache license, which is a
> >> requirement.
> >>
> >> Once you have the source code, take a look at the PageDrawer class to
> see
> >> how text and images are
> >> rendered. We want someone to interface at a low-level (e.g. one glyph,
> >> word, or sentence at a time) with
> >> an OCR engine. Also look at PDFTextStripper which is how text is
> currently
> >> extracted, take a look at how
> >> we have to go to great length to sort text back into reading order and
> >> infer the placement of diacritics - PDF
> >> is fundamentally a visual format, not a structured format like HTML -
> >> which is why extracting text can be so
> >> difficult sometimes.
> >>
> >> The full PDF Reference document can be found at:
> >>
> >>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>
> >> Feel free to discuss specifics of your proposal or ask any questions.
> >>
> >> Thanks,
> >>
> >> -- John
> >>
> >> On 23 Feb 2014, at 21:13, DImuthu Upeksha <[email protected]>
> >> wrote:
> >>
> >>> Hi,
> >>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
> University
> >> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache
> >> ISIS [1] project. I'm very much interested in OCR and image processing
> >> stuff. So I would like to select this project idea as my GSoC 2014
> project
> >> because I feel like it is the best suited project for me. In university
> >> also we have done some research in OCR area and our group wrote a
> >> literature review about increasing efficiency of OCR systems(attached).
> Can
> >> you please suggest me where to start learning about PDFBox?
> >>>
> >>> [1]
> >>
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>
> >>> Thank you
> >>> Dimuthu
> >>>
> >>> --
> >>> Regards
> >>> W.Dimuthu Upeksha
> >>> Undergraduate
> >>> Department of Computer Science And Engineering
> >>> University of Moratuwa, Sri Lanka
> >
> >
> > --
> > Regards
> >
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> >
> > University of Moratuwa, Sri Lanka
>
--
Regards
W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering
University of Moratuwa, Sri Lanka