Geoff Hutchison writes:
 > 
 > I'm actually very glad to hear from you. I've heard good things about
 > Catalog and obviously we have many interests in common. I took a look at

 Glad to hear that. There is so much to do and so little time :-)

 > It would be nice to share some of the webbase code (and/or URI), and
 > maybe parts of mifluz and Text::Query::SQL. There's quite a bit of
 > demand for a (my)SQL backend to ht://Dig, as well as parsing
 > AltaVista-style queries. However, I've been up to my eyeballs in getting
 > the database format changes ready and the new Transport code.

 I must confess that the first two contributions I envision are

 1) Switch to automake + libtool
 2) Use and SQL backend (encapsulate specific things in a shared lib
    module and implement a DBI like interface is roughly the idea).

 I was pleased to see you already had that in the wish list. 

 > I have no doubt that ht://Dig can handle millions of documents. There
 > are several sites in that ballpark, plus several more around 500,000+
 > documents. There are obvious problems with the size of the databases
 > (many OS limit files to 2GB), but this is greatly eased in the 3.2
 > codebase.

 Disclaimer : I may say stupid things because I didn't look at the code
carefully. It seems to me that a few factors effectively prevents large
scale crawler to be maintained:
      . The list of starting points URLs is in the configuration file.
        Our search engine has 150 000 starting points URLs, it is hard to
        manage if in a configuration file.
      . When the crawler updates URLs it does a network access for 
        all of them. Let's say I have 10 millions URLs, this is not really
        what I want. What I want is that a URL successfully fetched is
        not verified before a week (configurable). Generally speaking I
        want to specify update strategies that depend on the URL status
        (loaded, not modified, not found). I even want to specify a 
        different update strategy for every site, if appropriate (daily
        for newspapers, monthly for archives etc..).
  Of course this (and many other things) depend on the fact that you have
a real database in the back-end, not just a hash table. 

 > Fortunately, I certainly don't see ht://Dig going the way of isearch or
 > freewais--it was a bit touch-and-go last year before Andrew opened up
 > the CVS tree and I took over. None of us want to see that repeated. ;-)

 Thank you for the information.

 > I'm sure that's true. Right now, I'd prefer to work on it part-time,
 > though I often accept contract jobs for improving ht://Dig. Personally,
 > I'd prefer to focus on the maintainer aspects than the developer since I
 > don't consider myself a very outstanding coder. That's just my current
 > personal preference...

 You mean you would refuse a $60 000/year proposal ?-) Assuming that the
company hire you to "continue working on ht://dig" and does not assign
a project manager to you, does not assign dead lines. The only thing the
company does is giving you a salary to make sure ht://dig will not suffer
because at some point you'll find a well paid job that will eat all your
energy. I strongly believe a project like ht://dig needs at least two or
three full time, motivated, computer geeks. 

-- 
                Loic Dachary

                ECILA
                100 av. du Gal Leclerc
                93500 Pantin - France
                Tel: 33 1 56 96 09 80, Fax: 33 1 56 96 09 61
                e-mail: [EMAIL PROTECTED] URL: http://www.senga.org/

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to