Ainhoa,
Can I ask you to check whether _new_ PDF's are getting indexed
correctly?
 
I notice that the syntax used in the very first, commented, line of the
external_parsers  section looks different to the rest:
application/pdf->text/html /usr/local/bin/conv_doc.pl

Note the 'arrow' and mime-type bit after application/pdf. All of the
external_parsers  declarations in my config have this same bit, which
makes me suspect that none of your declarations will be working just
now, though if you have not rebuilt your databases from scratch this may
not be obvious. You probably want to be using at least -vv  (two letter
v's) to get verbose output from the dig process - this should tell you
what is happening during the indexing. My other thought is to check
whether the mht files are being served to you with that mime-type - this
won't work correctly if not, and you may need more than one
external_parsers  declaration to cover all possibilities.

Regards,
Mike


________________________________

        From: Ainhoa L [mailto:[EMAIL PROTECTED] 
        Sent: Wednesday, February 06, 2008 5:29 PM
        To: Brockington,MJ,Michael,JPGA4X R
        Cc: [email protected]
        Subject: Re: [htdig] Htdig and MHT files
        
        
        Hi Mike,
         
        You are talking about the version with the mht parser, right?
        I write here an extract of where I mention mht things and I
attach the whole file and the parser (originally the parser would create
files for the files appearing in the mht. I modified it so it will only
output the code in the htm file). Maybe this parser I modified is
sending some other garbage that can't be read by the indexer?
        
         
        bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com
.gif \
        .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi
.css 
         
        valid_extensions: .html .htm .shtml .php .uhtml .phtml .txt .pdf
.mht 
         
        external_parsers: application/postscript
/usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl\
application/pdf /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl
\ 
        application/mht /opt/vin/mht2html2.pl
         
        Thanks a lot for your help!
        Regards,
         
        Ainhoa


         
        On Feb 5, 2008 9:58 PM, <[EMAIL PROTECTED]> wrote:
        

                Can you show us at least an extract of your config file
- as you describe it this should work.
                
                Regards,
                Mike
                


                -----Original Message-----
                From: [EMAIL PROTECTED] on
behalf of Ainhoa L
                Sent: Tue 2/5/2008 4:09 PM
                To: [email protected]
                Subject: [htdig] Htdig and MHT files
                
                Hi! Maybe this is a very stupid question but, is it
possible to index mht
                files with htdig?
                I have tried with the mht in the valid_extensions list,
etc. Obviously htdig
                doesn't take them as html and refuses to index them. I
looked for a parser
                and found a mht2html parser, modified it so it just
sends through output the
                html. I added it to the parsers in the htdig config
file. This didn't work,
                although the parser returns valid html...
                I would like to know if there is any way to index mht
files with htdig?
                Thanks a lot for your help.
                
                


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to