[htdig] A solution to parse word, excel, etc.

Denis Valdenaire Wed, 28 Jan 2004 09:32:27 -0800

Hello all.

Not a question, but an answer.

I found a way to parse .doc, .pdf, .xls, well any format if you can transform it to text.

I explain it on my web site, in french, but hey, that's my natural laguage. Translations are welcome.

The small picture is this : - use mod_rewrite (apache) to rewrite URL when user-agent in htdig AND uri ends with .doc or .xls or pdf etc... - send it to a PHP page (but could be perl, or whatever) - open the file SCRIPT_URI (original URL) and convert it to text via pdftotext or catdoc etc...

this way documents are seen by htdig as text and are indexed as such. The original url (http://somehost.com/foo.doc) is preserved.

the good news is : you can do this with any search engine + eventually an external robot can parse your .doc.

hope it helps some of you.

i've searched how to use external_parsers, but not only it don't work, but i don't understand why.

Denis Valdenaire
<[EMAIL PROTECTED]>

-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

[htdig] A solution to parse word, excel, etc.

Reply via email to