Excuse me - I found the problem!

In the htdig.conf I was using this:

external_parsers:  \
        application/pdf /opt/home3/helpers/parse_doc/parse_doc.pl \
        application/msword->text/plain "/opt/home3/helpers/catdoc/bin/catdoc -w"

but instead I should use this:

external_parsers:  \
        application/pdf /opt/home3/helpers/parse_doc/parse_doc.pl \
        application/msword /opt/home3/helpers/parse_doc/parse_doc.pl

In the former, I believe that four parameters are being sent to catdoc, namely
infile content-type URL configuration-file
as per http://www.htdig.org/attrs.html#external_parsers
However catdoc expects a list of files...so it tries to catdoc infile, then catdoc's content type, then catdoc's
the URL, and then catdoc's the configuration file...thus the multiple "No such file..."

Not sure where I picked up that wrong catdoc syntax. Maybe others can learn from my mistake.

Logan



At 06:42 PM 7/7/01, you wrote:
I am running htdig with parse_doc.pl (2000/01/12), pdftotext and catdoc.

When I index my site, the htdig system indeed finds and indexes my .doc files (and also my .pdf files) properly. They are "findable" by their content. Fine.

However, I notice that for some reason my htdig.conf file is ALSO indexed, and it is findable simply by searching on a keyword or two of the htdig.conf file! The "returned hits" for the htdig.conf file are "entitled" & "linked to" by all of the .doc files rundig has indexed. That is, if I have two .doc files, if I search on, say,

"virtual web trees or database"

(this is a phrase inside of htdig.conf) then I get two returned hits, whose titles & links are to the .doc files, but whose excerpt is the htdig.conf file. Not fine.

I notice that during rundig, I get two of these errors for each .doc file to be indexed, e.g.

catdoc: No such file or directory
catdoc: No such file or directory

but indeed each .doc file is indexed just fine.

I am new to the htdig family, so do not think that I know what I am doing. (I am trying, I am trying.) Any help would be greatly appreciated.

Logan

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to