Re: Parsing Paragraphs from PDF.

Jeremy Arnold Wed, 23 Mar 2011 15:07:53 -0700

>You may also find the -layout parameter to be useful in helping retain
>approximate spacing on the page. If paragraphs are separated by space
>then you will get a blank line between paragraphs.

Is there any way to do something like this with PDFBox?

The spacing idea is one of the first approaches I was attempting to
take. Unfortunately I can't guarantee that Summary will be alone on
the line of text, as some PDFs have footnotes so in the text rendition
you end up seeing something like this: Summary1,2,3,4

The pdfs are fairly regular as it's guaranteed there will be a blank
line between the end of one paragraph and the Summary header for
another paragraph. Unfortunately when I look at the text that is
parsed using PDFBox I don't see that blank line.

As promising as the poppler-utils seems, I need a java solution as I
am going to be doing some real time processing and I don't think I can
rely on a few step process to use poppler-utils and then a perl
script, as much as I wish that was an option.

Thanks for the help.
Jeremy

On Wed, Mar 23, 2011 at 4:34 PM, Michael Howard <[email protected]> wrote:
> On Wed, Mar 23, 2011 at 4:58 PM, Jeremy Arnold
> <[email protected]> wrote:
> [snip]
>> Otherwise can anyone
>> recommend another way to go about grabbing specific paragraphs from a
>> PDF? I have a few thousand PDFs with a paragraph that has a header of
>> 'Summary'. I'd like to pull out the paragraphs associated with the
>
> I am not sure how regular your documents are but ...
>
> My first attempt would not involve using pdfbox.
>
> The first thing I would try would be using the pdftotext command line
> tool that is part of poppler-utils. This will not give you any font
> information. However, it will allow you to specify the region from
> which you would like to extract the text. For example, you can use
> this to eliminate headers + footers + sidebars.
>
> You may also find the -layout parameter to be useful in helping retain
> approximate spacing on the page. If paragraphs are separated by space
> then you will get a blank line between paragraphs.
>
> I would then take the output text and run it through perl regular
> expressions. If your target text begins with a text header that always
> says 'Summary' and is 1 paragraph long then it might be pretty easy to
> identify the target text as lying between 'Summary' and the first
> blank line.
>
> Good luck.
>

Re: Parsing Paragraphs from PDF.

Reply via email to