For some time we already have $subj
take a look at http://charon.fi.muni.cz/htdig/
or test it at
http://charon.fi.muni.cz/htdig/search.html
(we've indexet the "Technical part" of our faculty's web which is partly
Czech and partly English)
Unfortunately it's all in Czech sofar.
For some languages indexing over word roots makes much better sence than
over whole words this is absolutly true for Czech.
So we have experimented with lemma (commercial) and ajka (almost finished GNU)
lemmatization software to get word roots, finaly we took out part of ispell --
access to the hash and used this becouse it can be used also with other
languages (but it knows much fewer word forms than the other two).
At present we're trying to index out faculty's web, but it seems that
e algorithm htdig uses for creation of the inverted file is too naive --
seems to me like it's tryning to apply unix 'sort' on a 1GB file...
Anyone should be interested in what we've done in detail, please mail me.
--
Martin Povolný, [EMAIL PROTECTED], http://www.fi.muni.cz/~xpovolny
Výstavní 24, 603 00, Brno, Czech Republic
tel. home: 0420-5-43246090, mobile: 00420-603-913869
Key fingerprint = 0C 2C F0 6F D0 3E EC 39 AC 58 99 E1 72 FB 12 5C
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.