Hi,
I was referred to your remarks on ht://Dig in the context of picking
a search engine for the Debian pages. I don't know what version you
were looking at when you made your remarks about ht://Dig, but for
one it does support local indexing (and has for over a year).
One suggestion is that if you're trying it out, you'll want to use
the unstable package or compile it yourself. The "stable" version is
quite buggy in comparison with more recent versions. If you want to
see details about what's changed in recent versions, see
<http://www.htdig.org/RELEASE.html>
I'll also reply to your requirements:
>- Free (as in DFSG)
>- Able to handle large data sets (> 1 GB)
>- Able to keep separate indexes and merge them (so we don't have to reindex
> previous months mail archives, but can simply merge those with the current
> month). Merging indexes should be fairly efficient (in most cases where
> merging is implemented, it is not).
>- Able to search on specific parts of the data. For example, searching on
> subject or sender in mail. I don't care how this is implemented (through
> separate indices or through use of regex, e.g. /^Subject: .*How to
>get rich/)
> as long as it is possible.
>- Able to index files locally, i.e. without going through a web server.
>- Able to search using regex (optional). Next down would be searching for
> simple phrases. At a minimum, the ability to match arbitrary word endings
> (the equivalent of /^keyword\w*/).
>
>Here is what I've looked at so far:
>htdig - can't index locally, too slow
- Free: GPL'ed.
- Large data sets: Well, we think more along the lines of # of URLs.
But yes, there are people using ht://Dig with over a million URLs and
2GB+ databases. Several developers are working on transparent
compression that will help considerably.
- Separate Indexes w/ Merging: Yup, though this was added more
recently than your latest stable pacakge.
- Search Specific Parts: You could do this with separate indexes, but
not easily. The latest devel code at least makes this feasible,
though the search side of things isn't implemented.
- Local indexing: Yes, and it reverts to HTTP as needed.
- Regex/Phrases: Implemented in the development tree. However, you
mention word endings, which it already does through various fuzzy
matching algorithms including "endings," "substring," and "prefix."
As for speed, I'd obviously be interested in any performance
measurements you can provide. It certainly doesn't scale as well as
we'd like, but the more feedback we get on this, the better! And of
course, we don't turn down contributions in any form. :-)
You'll obviously make your own decision, but I'll also suggest that
you talk to the GNU webmasters and the folks running Linux.com, who
both chose ht://Dig.
Cheers,
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.