2-column cases

Filippis, Ioannis Wed, 14 Aug 2013 05:01:20 -0700

Hi all,

I wonder whether it is feasible to recognise whether the input pdf is a proper 
pdf (containing text) or a pdf from a scanner software (containing only 
images). For example in the second case I could use some ocr software like 
tesseract/tess4j. Looking at the Metadata.CONTENT_TYPE doesn't help, as it 
always equals to "application/pdf".


Moreover, I wonder how I could remove header/footer or even table text that are 
incorporated into output string and understand whether it is a two or one 
column pdf (so that I decide whether to sort by position or not).

Many thanks for your help.

Kind regards,
Ioannis

pdf from scanner and header/footer/2-column cases

Reply via email to