On Thu, Jan 16, 2003 at 10:56:35AM +0000, martin bower wrote: > Im writing a document management site, and am looking for pointers on how > to index html,pdf (maybe word) docs, and then search against them.
For HTML ... strip out all the tags, index as plain text PDF ... use pdf2txt, index as plain text Word ... use antiword, index as plain text If you want to index headers differently from body text or something like that, you're pretty much stuffed. It's not possible to divine what's a header in PDF, in Word you're unlikely to be able to extract anything useful, and in HTML most people seem to use the header tags to get bigger fonts and not for the purpose of marking up headers. -- David Cantrell | Benevolent Dictator | http://www.cantrell.org.uk/david o/~ I want my SMTP o/~