Hi all, I wonder whether it is feasible to recognise whether the input pdf is a proper pdf (containing text) or a pdf from a scanner software (containing only images). For example in the second case I could use some ocr software like tesseract/tess4j. Looking at the Metadata.CONTENT_TYPE doesn't help, as it always equals to "application/pdf".
Moreover, I wonder how I could remove header/footer or even table text that are incorporated into output string and understand whether it is a two or one column pdf (so that I decide whether to sort by position or not). Many thanks for your help. Kind regards, Ioannis