In the htdig.conf I was using this:
external_parsers: \
but instead I should use this:
external_parsers: \
In the former, I believe that four parameters are being sent to catdoc, namely
infile content-type URL configuration-file
as per http://www.htdig.org/attrs.html#external_parsers
However catdoc expects a list of files...so it tries to catdoc infile, then catdoc's content type, then catdoc's
the URL, and then catdoc's the configuration file...thus the multiple "No such file..."
Not sure where I picked up that wrong catdoc syntax. Maybe others can learn from my mistake.
Logan
At 06:42 PM 7/7/01, you wrote:
I am running htdig with parse_doc.pl (2000/01/12), pdftotext and catdoc.
When I index my site, the htdig system indeed finds and indexes my .doc files (and also my .pdf files) properly. They are "findable" by their content. Fine.
However, I notice that for some reason my htdig.conf file is ALSO indexed, and it is findable simply by searching on a keyword or two of the htdig.conf file! The "returned hits" for the htdig.conf file are "entitled" & "linked to" by all of the .doc files rundig has indexed. That is, if I have two .doc files, if I search on, say,
"virtual web trees or database"
(this is a phrase inside of htdig.conf) then I get two returned hits, whose titles & links are to the .doc files, but whose excerpt is the htdig.conf file. Not fine.
I notice that during rundig, I get two of these errors for each .doc file to be indexed, e.g.
catdoc: No such file or directory
catdoc: No such file or directory
but indeed each .doc file is indexed just fine.
I am new to the htdig family, so do not think that I know what I am doing. (I am trying, I am trying.) Any help would be greatly appreciated.
Logan
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-dev
