For a general solution you must know the geometric bounds of the header and footer. You then compare the x/y location of each text string (that is, contiguous sequence of characters within the PDF data stream) to see if they are within or without that boundary (with some heuristic for overlap).
If the publications you're operating on are consistent in their page layout then this can be relatively easy, but if they're not, you may need to have humans do "zoning" on each page manually by some means (e.g, you build a visual tool or use Acrobat to add boxes or whatever). If the header and footer contents are consistent you may be able to recognize them just be looking at the text content but that depends on the details of the document's you're processing. Cheers, E. On 4/18/13 9:19 AM, "rahul bhalla" <[email protected]> wrote: > hello > how can i ignore any header or footer of the pdfdoc while extracting text > becoz when i extract text of the document it read footer as a next page > content > > -- > Regards > Rahul Bhalla -- Eliot Kimber Senior Solutions Architect, RSI Content Solutions "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.rsicms.com www.rsuitecms.com Book: DITA For Practitioners, from XML Press, http://xmlpress.net/publications/dita/practitioners-1/

