I use iText for everything I can. For this specific case, I use pdfbox to extract the text from the first few pages (I first check how many pages are in the PDF), and if the number of words exceeds a preset threshold, I assume the PDF is text-indexible.
It's not foolproof, but it's part of my OCR solution, so if the PDF has less than the threshold number of words, I send it for OCR so it's an optimization more than anything (if it really is text-based, and the first page or two happens to be a coverpage or something that happens to have very few words by design, it won't hurt that I send it for OCR anyway -- just takes a little longer). -AJ ----- Original Message ----- From: "Bernhard Haslinger" <[email protected]> To: <[email protected]> Sent: Tuesday, July 19, 2011 8:58 AM Subject: [iText-questions] How to check if a PDF is OCR recognized > Dear all, > > I've a lot of all pdf Files - some of them are bitmaps some of them are > ocr > recognized. > Now I plan to let alle pfiles be ocr recognized but I dont want to scan > all > documents if this is possible because I think the biggest part of them is > already recognized. > > Is there a way to check with the iText library if a existing pdf has a ocr > layer or not? > > Please let me know :-) > Maybe there is another possibiliy (than iText) to solve my problem? > > Thanks in advance > bernhard > > -- > View this message in context: > http://itext-general.2136553.n4.nabble.com/How-to-check-if-a-PDF-is-OCR-recognized-tp3678057p3678057.html > Sent from the iText - General mailing list archive at Nabble.com. > > ------------------------------------------------------------------------------ > Magic Quadrant for Content-Aware Data Loss Prevention > Research study explores the data loss prevention market. Includes in-depth > analysis on the changes within the DLP market, and the criteria used to > evaluate the strengths and weaknesses of these DLP solutions. > http://www.accelacomm.com/jaw/sfnl/114/51385063/ > _______________________________________________ > iText-questions mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/itext-questions > > iText(R) is a registered trademark of 1T3XT BVBA. > Many questions posted to this list can (and will) be answered with a > reference to the iText book: http://www.itextpdf.com/book/ > Please check the keywords list before you ask for examples: > http://itextpdf.com/themes/keywords.php > ------------------------------------------------------------------------------ Magic Quadrant for Content-Aware Data Loss Prevention Research study explores the data loss prevention market. Includes in-depth analysis on the changes within the DLP market, and the criteria used to evaluate the strengths and weaknesses of these DLP solutions. http://www.accelacomm.com/jaw/sfnl/114/51385063/ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
