>You may also find the -layout parameter to be useful in helping retain >approximate spacing on the page. If paragraphs are separated by space >then you will get a blank line between paragraphs.
Is there any way to do something like this with PDFBox? The spacing idea is one of the first approaches I was attempting to take. Unfortunately I can't guarantee that Summary will be alone on the line of text, as some PDFs have footnotes so in the text rendition you end up seeing something like this: Summary1,2,3,4 The pdfs are fairly regular as it's guaranteed there will be a blank line between the end of one paragraph and the Summary header for another paragraph. Unfortunately when I look at the text that is parsed using PDFBox I don't see that blank line. As promising as the poppler-utils seems, I need a java solution as I am going to be doing some real time processing and I don't think I can rely on a few step process to use poppler-utils and then a perl script, as much as I wish that was an option. Thanks for the help. Jeremy On Wed, Mar 23, 2011 at 4:34 PM, Michael Howard <[email protected]> wrote: > On Wed, Mar 23, 2011 at 4:58 PM, Jeremy Arnold > <[email protected]> wrote: > [snip] >> Otherwise can anyone >> recommend another way to go about grabbing specific paragraphs from a >> PDF? I have a few thousand PDFs with a paragraph that has a header of >> 'Summary'. I'd like to pull out the paragraphs associated with the > > I am not sure how regular your documents are but ... > > My first attempt would not involve using pdfbox. > > The first thing I would try would be using the pdftotext command line > tool that is part of poppler-utils. This will not give you any font > information. However, it will allow you to specify the region from > which you would like to extract the text. For example, you can use > this to eliminate headers + footers + sidebars. > > You may also find the -layout parameter to be useful in helping retain > approximate spacing on the page. If paragraphs are separated by space > then you will get a blank line between paragraphs. > > I would then take the output text and run it through perl regular > expressions. If your target text begins with a text header that always > says 'Summary' and is 1 paragraph long then it might be pretty easy to > identify the target text as lying between 'Summary' and the first > blank line. > > Good luck. >

