On Thu, Jan 16, 2003 at 10:56:35AM +0000, martin bower wrote:
> Im writing a document management site, and am looking for pointers on how 
> to index html,pdf (maybe word) docs, and then search against them.

For HTML ... strip out all the tags, index as plain text
PDF ... use pdf2txt, index as plain text
Word ... use antiword, index as plain text

If you want to index headers differently from body text or something like
that, you're pretty much stuffed.

It's not possible to divine what's a header in PDF, in Word you're
unlikely to be able to extract anything useful, and in HTML most people
seem to use the header tags to get bigger fonts and not for the purpose
of marking up headers.

-- 
David Cantrell | Benevolent Dictator | http://www.cantrell.org.uk/david

  o/~ I want my SMTP o/~

Reply via email to