Geoff Hutchison writes:
>
> I'm actually very glad to hear from you. I've heard good things about
> Catalog and obviously we have many interests in common. I took a look at
Glad to hear that. There is so much to do and so little time :-)
> It would be nice to share some of the webbase code (and/or URI), and
> maybe parts of mifluz and Text::Query::SQL. There's quite a bit of
> demand for a (my)SQL backend to ht://Dig, as well as parsing
> AltaVista-style queries. However, I've been up to my eyeballs in getting
> the database format changes ready and the new Transport code.
I must confess that the first two contributions I envision are
1) Switch to automake + libtool
2) Use and SQL backend (encapsulate specific things in a shared lib
module and implement a DBI like interface is roughly the idea).
I was pleased to see you already had that in the wish list.
> I have no doubt that ht://Dig can handle millions of documents. There
> are several sites in that ballpark, plus several more around 500,000+
> documents. There are obvious problems with the size of the databases
> (many OS limit files to 2GB), but this is greatly eased in the 3.2
> codebase.
Disclaimer : I may say stupid things because I didn't look at the code
carefully. It seems to me that a few factors effectively prevents large
scale crawler to be maintained:
. The list of starting points URLs is in the configuration file.
Our search engine has 150 000 starting points URLs, it is hard to
manage if in a configuration file.
. When the crawler updates URLs it does a network access for
all of them. Let's say I have 10 millions URLs, this is not really
what I want. What I want is that a URL successfully fetched is
not verified before a week (configurable). Generally speaking I
want to specify update strategies that depend on the URL status
(loaded, not modified, not found). I even want to specify a
different update strategy for every site, if appropriate (daily
for newspapers, monthly for archives etc..).
Of course this (and many other things) depend on the fact that you have
a real database in the back-end, not just a hash table.
> Fortunately, I certainly don't see ht://Dig going the way of isearch or
> freewais--it was a bit touch-and-go last year before Andrew opened up
> the CVS tree and I took over. None of us want to see that repeated. ;-)
Thank you for the information.
> I'm sure that's true. Right now, I'd prefer to work on it part-time,
> though I often accept contract jobs for improving ht://Dig. Personally,
> I'd prefer to focus on the maintainer aspects than the developer since I
> don't consider myself a very outstanding coder. That's just my current
> personal preference...
You mean you would refuse a $60 000/year proposal ?-) Assuming that the
company hire you to "continue working on ht://dig" and does not assign
a project manager to you, does not assign dead lines. The only thing the
company does is giving you a salary to make sure ht://dig will not suffer
because at some point you'll find a well paid job that will eat all your
energy. I strongly believe a project like ht://dig needs at least two or
three full time, motivated, computer geeks.
--
Loic Dachary
ECILA
100 av. du Gal Leclerc
93500 Pantin - France
Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.