Hi all,

I wonder whether it is feasible to recognise whether the input pdf is a proper 
pdf (containing text) or a pdf from a scanner software (containing only 
images). For example in the second case I could use some ocr software like 
tesseract/tess4j. Looking at the Metadata.CONTENT_TYPE doesn't help, as it 
always equals to "application/pdf".

Moreover, I wonder how I could remove header/footer or even table text that are 
incorporated into output string and understand whether it is a two or one 
column pdf (so that I decide whether to sort by position or not).

Many thanks for your help.

Kind regards,
Ioannis

Reply via email to