In-Reply-To: <[EMAIL PROTECTED]>
> Date: Sat, 18 Jan 2003 21:09:53 +0000
> To: [EMAIL PROTECTED]
> From: Robert Isaac <[EMAIL PROTECTED]>
> Subject: [htdig] pdf files
> 
> This may sound a silly question, but if pdf files need to be indexed 
> with htdig with an external parser, does the text in the files to be 
> pdf's need to be scanned as text, or can they still be read if scanned 
> as image. The reason I ask is that many of the documents I want to pdf 
> have poor paper and type. Thanks

There need to be words in there for any indexing to work.

The project to write an indexer for images is up there 
with the (apocryphal?) story of Marvin Minsky, at the 
height of  1950s Artificial Intelligence hubris, giving 
a grad student a summer project - "vision". 

So your best bet is to do Optical Character Recognition
and post the docs as HTML, perhaps with a GIF included
as insurance against mis-correction. 

If they're that bad, it may be quicker to retype, since
the world's best OCR program resides just behind your 
eyes. 

Mike




-------------------------------------------------------
This SF.NET email is sponsored by: FREE  SSL Guide from Thawte
are you planning your Web Server Security? Click here to get a FREE
Thawte SSL guide and find the answers to all your  SSL security issues.
http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to