Re: [htdig3-dev] external parsers

Geoff Hutchison Wed, 24 Nov 1999 13:57:24 -0800
Phew! I need to type faster. As I wrote a response to Tom, Gilles sent this!

At 12:00 PM -0600 11/22/99, Gilles Detillieux wrote:
>Now I'm a bit fuzzy on the history, because all of this happened over
>a year ago when I came on the scene, but I believe that external parser

I came on as maintainer about 18 months ago. I *do* know that I added 
the PDF.cc parser, which was contributed about the same time I came 
on. I just looked things up using CVSweb:

http://dev.htdig.org/cgi-bin/cvsweb.cgi/htdig3/htdig/ExternalParser.cc
http://dev.htdig.org/cgi-bin/cvsweb.cgi/htdig3/htdig/PDF.cc

These indicate that the ExternalParser class dates to 1997.

So why did I add Sylvain's PDF parser when there was external_parser 
support? It worked and as Gilles said, it's often difficult to write 
a complete external parser! So since it was provided, I thought we 
should go with it. At the time, it seemed like a good idea. Hindsight 
is 20/20...

>Yes, that's correct.  Most document types other than HTML are
>currently dealt with just as plain text, ultimately, so all structural
>information is lost.  The only exception to this is the latest version
>of parse_doc.pl, which has a hook in it to extract the title from PDFs

For example, there's a PDF library in Perl that supposedly lets you 
grab various meta-information. However, no one has written an 
external_parser that uses it. Even if it did, I don't know how useful 
it would be since in general such information is only sparsely used.

>Yes, if someone could add a good, efficient and reliable XML parser to
>htdig, that would certainly be the way to go.

Such parsers exist. Hopefully someone might have a good idea about 
how to use it!

>Yeah, but I think we'd need to get to the bottom of why exactly htdig
>is too slow.  I don't think the current HTML parser is necessarily the
>model of efficiency either, so it may be that a well designed XML parser
>in its place wouldn't slow things down too much.  I think attention really
>needs to be paid to the database back-end, and minimizing the amount of
>copying of huge strings that takes place in the current code.

We may want to start splitting into different threads. When we talk 
about speed, we should be careful about what component we're talking 
about. First off, Tom? Where did you hear it wasn't fast enough?

As far as the indexer, I'd guess the main slowdown comes in database 
operations. String optimizations wouldn't hurt, but database lookups 
kill us, esp on large databases. But careful profiling and 
optimization on 3.2 still needs to be done.

>  > Can htdig's config parsing handle multiple directives with the same
>  > name? (If I recall correctly it only remembers the last one seen.) I
>  > was just thinking that it might be cleaner to specify items like this
>  > using multiple directives like this:
>  >
>  >   external_parser: text/html /usr/local/bin/htmlparser
>  >   external_parser: application/ms-word "mswordparser -w"
>The problem is you want to be able to override attributes that were
>defined previously, in an include file for example.  I'd favour a
>different syntax for appending to an already defined attribute, e.g.:
>
>       bad_extensions += .pdf

Right. Plus this sort of change will be a bit easier to do now that 
the guts of the config parser was re-written by Vadim in bison/flex.

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.
Re: [htdig3-dev] external parsers

Reply via email to