Re: ignore header and footer

Eliot Kimber Thu, 18 Apr 2013 07:37:35 -0700

For a general solution you must know the geometric bounds of the header and
footer. You then compare the x/y location of each text string (that is,
contiguous sequence of characters within the PDF data stream) to see if they
are within or without that boundary (with some heuristic for overlap).

If the publications you're operating on are consistent in their page layout
then this can be relatively easy, but if they're not, you may need to have
humans do "zoning" on each page manually by some means (e.g, you build a
visual tool or use Acrobat to add boxes or whatever).

If the header and footer contents are consistent you may be able to
recognize them just be looking at the text content but that depends on the
details of the document's you're processing.

Cheers,

E.

On 4/18/13 9:19 AM, "rahul bhalla" <[email protected]> wrote:

> hello
> how can i ignore any header or footer of the pdfdoc while extracting text
> becoz when i extract text of the document it read footer as a next page
> content
> 
> --
> Regards
> Rahul Bhalla

-- 
Eliot Kimber
Senior Solutions Architect, RSI Content Solutions
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
www.rsicms.com
www.rsuitecms.com
Book: DITA For Practitioners, from XML Press,
http://xmlpress.net/publications/dita/practitioners-1/

Re: ignore header and footer

Reply via email to