Geoff Hutchison <[EMAIL PROTECTED]> writes:
> First off, Tom? Where did you hear it wasn't fast enough?
Before installing Ht://Dig I spent about 4 hours reviewing the "word
on the street" comments on Usenet and another 4 hours reading web
sites that reviewed search engines and the documentation for a few
search engines.
Here's one message remarking on the slowness:
From: [EMAIL PROTECTED] (James A. Treacy)
Subject: Re: Search Engine
Date: 23 Sep 1999 00:00:00 GMT
Newsgroups: linux.debian.www
Here is what I've looked at so far:
htdig - can't index locally, too slow
mg - still evaluating
namazu - haven't really looked at
psearch - variant of isearch. promising, but still under
development
swish++ - can't merge separate indices. great for straight
html though
glimpse - non-free
And I've been meaning to ask you guys about the "can't index locally"
issue. Recently while browsing through the config directives, I ran
across:
http://www.htdig.org/attrs.html#local_urls
local_urls
...
description:
Set this to tell ht://Dig to access certain URLs
through local filesystems. At first ht://Dig will try to
access pages with URLs matching the patterns
through the filesystems specified. If it cannot find the
file, it will try the URL through HTTP instead. Note
the example--the equal sign and the final slashes in
both the URL and the directory path are critical.
example:
local_urls: http://www.foo.com/=/usr/www/htdocs/
Does this work? (I haven't tried it.) If so, why isn't this mentioned
in some intro document or why doesn't configure prompt the user to set
it up. My guess is that 90% or more of Ht://Dig installations occur on
the machine containing the HTML to be indexed, so if this works as an
efficiency improvement, I'd think most people would want to use it.
And for this related directive:
http://www.htdig.org/attrs.html#local_default_doc
local_default_doc
...
default:
index.html
description:
Set this to the default document in a directory used
by the server. This is used for local filesystem access
to translate URLs like http://foo.com/ into something
like /home/foo.com/index.html
example:
local_default_doc: default.html
I'd suggest that this should be a string list the same as
remove_default_doc is. The same rationale applies.
Ideally, one would like to see configure ask for the path to an Apache
config file and it would extract information like this. (Though
understandably not trivial to do.)
> We may want to start splitting into different threads.
That may make sense. It is also possible that you might just want to
concede that Ht://Dig isn't going to be optimized for speed. Instead
concentrate on other attributes, like ease of installation and
administration. Define your market and stick with it, so to speak.
The site I applied Ht://Dig to was under 10 MB of HTML and PDF files
and it was indexed in a matter of a few minutes via HTTP. Running once
a day at 3 AM, that's insignificant.
If a threaded version required more effort to compile or administer,
it wouldn't be worth it for such a simple application. Most web sites
probably fall within Ht://Dig's performance range.
> As far as the indexer, I'd guess the main slowdown comes in database
> operations. String optimizations wouldn't hurt, but database lookups
> kill us, esp on large databases.
Database size is the other "word on the street" complaint, but I'm
sure you're well aware of that, as even the Ht://Dig documentation
reflects this.
By the way, when did you switch from GDBM to Berkeley DB2? I tried
using one of the user contributed Perl scripts that was setup for GDBM
and ended up fiddling with it for an hour. Perl's GDBM_File module is
notoriously bad at reporting GDBM library errors so I finally threw
together a test program in C and got back "bad magic number." Argh.
Further examination of the Ht://Dig and more digging in the contrib
directory showed you were using Berkeley DB2 now. (And lucky me, the
target system's Berkeley DB2 library is too old to support Perl's
Berkeley DB2 interface...cascading upgrade.)
If I can get things to work, I'll update those Perl scripts to
Berkeley DB2 and submit them to you guys. They appear to be pretty
close to the reporting functionality I asked about in an earlier
email, and I think they deserve mention in some of the introductory
documentation.
-Tom
--
Tom Metro
Venture Logic [EMAIL PROTECTED]
Newton, MA, USA
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.