Ainhoa,
Can I ask you to check whether _new_ PDF's are getting indexed
correctly?
I notice that the syntax used in the very first, commented, line of the
external_parsers section looks different to the rest:
application/pdf->text/html /usr/local/bin/conv_doc.pl
Note the 'arrow' and mime-type bit after application/pdf. All of the
external_parsers declarations in my config have this same bit, which
makes me suspect that none of your declarations will be working just
now, though if you have not rebuilt your databases from scratch this may
not be obvious. You probably want to be using at least -vv (two letter
v's) to get verbose output from the dig process - this should tell you
what is happening during the indexing. My other thought is to check
whether the mht files are being served to you with that mime-type - this
won't work correctly if not, and you may need more than one
external_parsers declaration to cover all possibilities.
Regards,
Mike
________________________________
From: Ainhoa L [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 06, 2008 5:29 PM
To: Brockington,MJ,Michael,JPGA4X R
Cc: [email protected]
Subject: Re: [htdig] Htdig and MHT files
Hi Mike,
You are talking about the version with the mht parser, right?
I write here an extract of where I mention mht things and I
attach the whole file and the parser (originally the parser would create
files for the files appearing in the mht. I modified it so it will only
output the code in the htm file). Maybe this parser I modified is
sending some other garbage that can't be read by the indexer?
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com
.gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi
.css
valid_extensions: .html .htm .shtml .php .uhtml .phtml .txt .pdf
.mht
external_parsers: application/postscript
/usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl\
application/pdf /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl
\
application/mht /opt/vin/mht2html2.pl
Thanks a lot for your help!
Regards,
Ainhoa
On Feb 5, 2008 9:58 PM, <[EMAIL PROTECTED]> wrote:
Can you show us at least an extract of your config file
- as you describe it this should work.
Regards,
Mike
-----Original Message-----
From: [EMAIL PROTECTED] on
behalf of Ainhoa L
Sent: Tue 2/5/2008 4:09 PM
To: [email protected]
Subject: [htdig] Htdig and MHT files
Hi! Maybe this is a very stupid question but, is it
possible to index mht
files with htdig?
I have tried with the mht in the valid_extensions list,
etc. Obviously htdig
doesn't take them as html and refuses to index them. I
looked for a parser
and found a mht2html parser, modified it so it just
sends through output the
html. I added it to the parsers in the htdig config
file. This didn't work,
although the parser returns valid html...
I would like to know if there is any way to index mht
files with htdig?
Thanks a lot for your help.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general