In-Reply-To: <[EMAIL PROTECTED]> > Date: Sat, 18 Jan 2003 21:09:53 +0000 > To: [EMAIL PROTECTED] > From: Robert Isaac <[EMAIL PROTECTED]> > Subject: [htdig] pdf files > > This may sound a silly question, but if pdf files need to be indexed > with htdig with an external parser, does the text in the files to be > pdf's need to be scanned as text, or can they still be read if scanned > as image. The reason I ask is that many of the documents I want to pdf > have poor paper and type. Thanks
There need to be words in there for any indexing to work. The project to write an indexer for images is up there with the (apocryphal?) story of Marvin Minsky, at the height of 1950s Artificial Intelligence hubris, giving a grad student a summer project - "vision". So your best bet is to do Optical Character Recognition and post the docs as HTML, perhaps with a GIF included as insurance against mis-correction. If they're that bad, it may be quicker to retype, since the world's best OCR program resides just behind your eyes. Mike ------------------------------------------------------- This SF.NET email is sponsored by: FREE SSL Guide from Thawte are you planning your Web Server Security? Click here to get a FREE Thawte SSL guide and find the answers to all your SSL security issues. http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

