Jackrabbit PDF text extractor uses PDFBox. If Adobe Reader can search the text then PDFBox should be capable of extract this text, but I only is my opinion.
On Mon, Jan 26, 2009 at 5:47 PM, Péterfi Balázs <[email protected]> wrote: > I think it has already OCRed because as I wrote I can search in the pdf with > adobe reader and it also selects the result. But what I see is a scanned > paper and I guess there is a text layer "behind" it. Is it possible? > > Paco Avila írta: >> >> You can make a text extractor which perform an OCR. >> >> On Mon, Jan 26, 2009 at 5:25 PM, Péterfi Balázs <[email protected]> >> wrote: >> >>> >>> Hello, >>> >>> I'm developing an application that uses jackrabbit and have some problem >>> with searching in pdf files. When I search in a pdf that was generated >>> from >>> a word document it works. When I try to search in a pdf that has a >>> scanned >>> document inside it and I can search through its contents from within >>> Adobe >>> Reader (some sort of Optical Character Recognition) but my application >>> does >>> not obtain results. I don't know how does this kind of pdf work but I >>> need >>> to search in it. Does jackrabbit support it? >>> >>> Thank you! >>> Balazs >>> >>> >>> >> >> >> >> > -- Paco Avila GIT Consultors tel: +34 971 498310 fax: +34 971496189 e-mail: [email protected] http://www.git.es
