Please see: https://issues.apache.org/jira/browse/PDFBOX-521
for an improved PDFTextStripper2 class that provides a little more instrumentation of the text parsing than the current PDFTextStripper class included in the distro. -----Original Message----- From: Navendu Garg [mailto:[email protected]] Sent: Tuesday, September 15, 2009 11:52 AM To: [email protected] Subject: Re: PDF Region Parsing Andy, An easy way to extract Chapter 3 and Chapter 5 is to extract text page by page using the org.apache.pdfbox.util.PDFTextStripper class and if possible use regular expressions to determine the starting and ending pages for each chapter. I personally like the idea of extending PDFTextStripper class and the use protected methods writeCharacters (To extract character by character information) startPage endPage processLineSeparator which are called whenever a given event type occurs. For example, starting of page (startPage), ending of page( endPage), line separator (processLineSeparator). All the information pertaining to a character is populated in a TextPosition class instance, which is passed as a parameter to the writeCharacters method. Sometimes PDFs comes with bookmarks, which make give you the starting pages for the each chapter. If so, your task will become easier. PDDocument pdfDoc = PDDocument.load(new File("abc.pdf")); PDDocumentCatalog catalog = pdfDoc.getDocumentCatalog(); List allPages= catalog.getAllPages(); // Extract PDF bookmarks. List<Bookmark> bookmarks = new ArrayList<Bookmark>(); PDDocumentOutline outline = catalog.getDocumentOutline(); if (outline != null) { extractBookmarks(allPages, outline.getFirstChild(), bookmarks, pdfDoc); } private void extractBookmarks(List<PDPage> allPages, PDOutlineItem bookmark, List<Bookmark> bookmarkList, PDDocument document) throws IOException { if (bookmark == null) return; while (bookmark != null) { String title = bookmark.getTitle(); PDPage page = bookmark.findDestinationPage(document); int pageNumber = getPageNumber(page); bookmarkList.add(new Bookmark(pageNumber, title)); PDOutlineItem child = bookmark.getFirstChild(); extractBookmarks(allPages, child, bookmarkList, document); bookmark = bookmark.getNextSibling(); } } As far as deciding if PDFBox is right for you or not. It depends on what are your goals. I have found PDFBox to be very useful for text extraction. It has a rich API based on the Adobe PDF Reference (http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf). So in case there is something specific you want to extract from the library, you can follow the instructions in the PDF Reference and use the PDFBox library to extract that information. For example, sometimes you can add printed page numbers to the PDF using Adobe Acrobat Professional (Appendix A, A-1, i, ii, iii etc). Currently, PDFBox API does not provide you a method to extract this information that maps printed page labels to default page sequence in the PDF document. However, I was able to extract this information by going through through PDF reference and using appropriate PDFBox data structures. I am planning to submit this as a patch soon. That being said, the library has certainly some of learning curve. Hope this helps. Navendu Garg On Tue, Sep 15, 2009 at 10:22 AM, [email protected] <[email protected]> wrote: > Hi all, > > I am new to PDFBox and want to ask a few questions to make sure that PDFBox > is the right choice for what I want to do. > > Is there a way to use PDFBox so that I am able to extract specific portions > of a PDF document - for example "Chapter 3" and "Chapter 5" from an e-book? > What kind of auxiliary information will PDFBox help me with? I think it will > be appropriate to use font features (size, bold/not-bold etc.). Is PDFBox > the right choice and how easy or difficult is it going to be? > > Thanks in advance. > Andy >
