Re: PDF Region Parsing

Navendu Garg Tue, 15 Sep 2009 08:52:23 -0700

Andy,

An easy way to extract Chapter 3 and Chapter 5 is to extract text page
by page using the org.apache.pdfbox.util.PDFTextStripper class and if
possible use regular expressions to determine the starting and ending
pages for each chapter. I personally like the idea of extending
PDFTextStripper class and the use protected methods

writeCharacters (To extract character by character information)
startPage
endPage
processLineSeparator

which are called whenever a given event type occurs. For example,
starting of page (startPage), ending of page( endPage), line separator
(processLineSeparator). All the information pertaining to a character
is populated in a TextPosition class instance, which is passed as a
parameter to the writeCharacters method.

Sometimes PDFs comes with bookmarks, which make give you the starting
pages for the each chapter. If so, your task will become easier.

                        PDDocument pdfDoc = PDDocument.load(new 
File("abc.pdf"));
                        PDDocumentCatalog catalog = pdfDoc.getDocumentCatalog();
                        List allPages= catalog.getAllPages();

                        // Extract PDF bookmarks.
                        List<Bookmark> bookmarks = new ArrayList<Bookmark>();
                        PDDocumentOutline outline = 
catalog.getDocumentOutline();
                        if (outline != null) {
                                extractBookmarks(allPages, 
outline.getFirstChild(), bookmarks, pdfDoc);
                        }

        private void extractBookmarks(List<PDPage> allPages, PDOutlineItem
bookmark, List<Bookmark> bookmarkList,
                        PDDocument document)

        throws IOException {
                if (bookmark == null)
                        return;

                while (bookmark != null) {
                        String title = bookmark.getTitle();
                        PDPage page = bookmark.findDestinationPage(document);
                        int pageNumber = getPageNumber(page);
                        bookmarkList.add(new Bookmark(pageNumber, title));
                        PDOutlineItem child = bookmark.getFirstChild();
                        extractBookmarks(allPages, child, bookmarkList, 
document);
                        bookmark = bookmark.getNextSibling();
                }
        }

As far as deciding if PDFBox is right for you or not. It depends on
what are your goals. I have found PDFBox to be very useful for text
extraction. It has a rich API based on the Adobe PDF Reference
(http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf).  So in
case there is something specific you want to extract from the library,
you can follow the instructions in the PDF Reference and use the
PDFBox library to extract that information. For example, sometimes you
can add printed page numbers to the PDF using Adobe Acrobat
Professional (Appendix A, A-1, i, ii, iii etc). Currently, PDFBox API
does not provide you a method to extract this information that maps
printed page labels to default page sequence in the PDF document.
However, I was able to extract this information by going through
through PDF reference and using appropriate PDFBox data structures. I
am planning to submit this as a patch soon. That being said, the
library has certainly some of learning curve.

Hope this helps.

Navendu Garg

On Tue, Sep 15, 2009 at 10:22 AM, [email protected]
<[email protected]> wrote:
> Hi all,
>
> I am new to PDFBox and want to ask a few questions to make sure that PDFBox
> is the right choice for what I want to do.
>
> Is there a way to use PDFBox so that I am able to extract specific portions
> of a PDF document - for example "Chapter 3" and "Chapter 5" from an e-book?
> What kind of auxiliary information will PDFBox help me with? I think it will
> be appropriate to use font features (size, bold/not-bold etc.). Is PDFBox
> the right choice and how easy or difficult is it going to be?
>
> Thanks in advance.
> Andy
>

Re: PDF Region Parsing

Reply via email to