What you actually /want/ is virtually impossible in the general case.
It may be possible for your specific PDFs, but we can't know that unless
we actually see one.


There is a "positional text extraction" RenderListener:
LocationTextExtractionStrategy.  It groups text by orientation, and then
by reading order... IIRC.  Pretty much, yeah:

 * This renderer keeps track of the orientation and distance (both
perpendicular
 * and parallel) to the unit vector of the orientation.  Text is ordered
by
 * orientation, then perpendicular, then parallel distance.  Text with
the same
 * perpendicular distance, but different parallel distance is treated as
being on
 * the same line.
 * <br>
 * This renderer also uses a simple strategy based on the font metrics
to determine if
 * a blank space should be inserted into the output.

It won't separate the header and footer from the body, but its probably
your best bet.

--Mark Storer
  Senior Software Engineer
  Cardiff.com
 
import legalese.Disclaimer;
Disclaimer<Cardiff> DisCard = null;
 
 

> -----Original Message-----
> From: DivyaKambhatla [mailto:[email protected]] 
> Sent: Wednesday, February 16, 2011 5:40 AM
> To: [email protected]
> Subject: [iText-questions] Split a PDF Page into header , 
> footer and body.
> 
> 
> Hi,
> 
>    Could anyone please let me know if it is possible via 
> iText5.0.5 to split a PDF Page into its header, footer, body 
> and watermark sections and access each content separately. I 
> am dealing with both watermarked and non-watermarked PDFs.
> 
>         When i extract the content from a PDF using 
> iText5.0.5, the order in which the extraction happens is as follows:
> 
>               1. Watermark gets extracted first (if it exists)
>               2. Page Text Content gets extracted next
>               3. The titles of any figures that are present 
> in the PDF Page.
>               4.  Footer Content gets extracted.
>               5. Header content gets extracted last.
> 
>    Is there any way of extraction such that , the complete 
> PDF Body content can be extracted first and the remaining 
> content such as watermarks, headers , footers be extracted 
> next so that the order of the extracted text is not lost.
> 
> Thanks,
> Divya.
>  
> --
> View this message in context: 
> http://itext-general.2136553.n4.nabble.com/Split-a-PDF-Page-in
to-header-footer-and-body-tp3308836p3308836.html
> Sent from the iText - General mailing list archive at Nabble.com.
> 
> --------------------------------------------------------------
> ----------------
> The ultimate all-in-one performance toolkit: Intel(R) 
> Parallel Studio XE:
> Pinpoint memory and threading errors before they happen.
> Find and fix more than 250 security defects in the development cycle.
> Locate bottlenecks in serial and parallel code that limit performance.
> http://p.sf.net/sfu/intel-dev2devfeb
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
> 
> Many questions posted to this list can (and will) be answered 
> with a reference to the iText book: 
> http://www.itextpdf.com/book/ Please check the keywords list 
> before you ask for examples: http://itextpdf.com/themes/keywords.php
> 
> 

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to