I use iText for everything I can.  For this specific case, I use pdfbox to 
extract the text from the first few pages (I first check how many pages are 
in the PDF), and if the number of words exceeds a preset threshold, I assume 
the PDF is text-indexible.

It's not foolproof, but it's part of my OCR solution, so if the PDF has less 
than the threshold number of words, I send it for OCR so it's an 
optimization more than anything (if it really is text-based, and the first 
page or two happens to be a coverpage or something that happens to have very 
few words by design, it won't hurt that I send it for OCR anyway -- just 
takes a little longer).

-AJ

----- Original Message ----- 
From: "Bernhard Haslinger" <[email protected]>
To: <[email protected]>
Sent: Tuesday, July 19, 2011 8:58 AM
Subject: [iText-questions] How to check if a PDF is OCR recognized


> Dear all,
>
> I've a lot of all pdf Files - some of them are bitmaps some of them are 
> ocr
> recognized.
> Now I plan to let alle pfiles be ocr recognized but I dont want to scan 
> all
> documents if this is possible because I think the biggest part of them is
> already recognized.
>
> Is there a way to check with the iText library if a existing pdf has a ocr
> layer or not?
>
> Please let me know :-)
> Maybe there is another possibiliy (than iText) to solve my problem?
>
> Thanks in advance
> bernhard
>
> --
> View this message in context: 
> http://itext-general.2136553.n4.nabble.com/How-to-check-if-a-PDF-is-OCR-recognized-tp3678057p3678057.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Magic Quadrant for Content-Aware Data Loss Prevention
> Research study explores the data loss prevention market. Includes in-depth
> analysis on the changes within the DLP market, and the criteria used to
> evaluate the strengths and weaknesses of these DLP solutions.
> http://www.accelacomm.com/jaw/sfnl/114/51385063/
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a 
> reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples: 
> http://itextpdf.com/themes/keywords.php
> 


------------------------------------------------------------------------------
Magic Quadrant for Content-Aware Data Loss Prevention
Research study explores the data loss prevention market. Includes in-depth
analysis on the changes within the DLP market, and the criteria used to
evaluate the strengths and weaknesses of these DLP solutions.
http://www.accelacomm.com/jaw/sfnl/114/51385063/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to