Re: [htdig-dev] Retriever/Parser

Geoff Hutchison Tue, 29 Jan 2002 11:47:03 -0800

On Tue, 29 Jan 2002, Neal Richter wrote:

> defined.. virtual void functions, etc.  The current Retriever is highly
> build around the idea of webpages and HTTP (of course)...


The Retriever class isn't really built around much of anything IMHO. It
requires that documents have a URL and that the URLs can be grouped into
Server objects. 

In the 3.1 code, the Document class was bound to webpages and HTTP. In the
3.2 code, the Document class is more of a "branch point," picking
Transport and Parser objects as needed.

> you could write a retriever to get docs directly out of a database,
> from a file, scp, POP, via parameters, etc.

The distinction I drew when working on indexing towards the beginning of
3.2 code was between the Retriever (the spider itself) and the
Transport. The latter basically handles a specific URL schema. So there's
now HtHTTP, HtFile, HtNNTP and External transport classes. It sounds like
you're talking more about Transport-type concepts. 

Again, I think it's the URL that's the critical point. Otherwise how are
the search results useful? How do you "jump to" a particular result from
the output? The databases tie the URL to the DocID which is used
internally, but this doesn't seem particularly useful to the outside
world. Maybe I'm misunderstanding you.

> unix-style mail files, 
> XML files (given a spec.. see XSLT), 
> ...
> other document formats..

Certainly all of these would work fine within the current Parser class
with perhaps some additional revision. The current trend has been to cut
out Parser classes in favor of the external parsers and external
converters. Either approach can work *if* the parsers are
maintainable. (The previous PostScript and PDF parsers weren't.)

If it seems that the HTML parser is somehow "special" it's simply that the
other remaining parser classes are much simpler. There's very little left
in the 3.2 code that cares whether it's HTML or XML or whatever.

>       Again, as I write and test this stuff I'll forward .tgz files with
> a script to do the setup and diff-ing.  Feel free to use it or pipe it to
> /dev/null if all you want is a web-crawling search engine. ;-)

Your work is appreciated. I'm just trying to point out a few things as
someone who's been around for a while.

1) We've been moving in this direction with 3.2 and for most purposes,
IMHO it's already there. Certainly if you have other suggestions, feel
free to contribute.

2) It's better not to reinvent the wheel. The less code that needs to be
maintained, generally the better. Do we really need new Retriever classes,
or do we need to refactor what we have?

3) There are differing philosophies on the Parser class and what should be
internal ht://Dig code and what should be plugged through the external
parsers and converters.
(As far as #3, I'm personally all for new Parser subclasses if they won't
become headaches like the old PDF.cc became.)

-Geoff


_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Retriever/Parser

Reply via email to