On Wed, 4 Dec 2002 19:12, Gabriele Bartolini wrote:

> IMHO, the ultimate goal for a search process is to get a
> set of document satisfying a semantic criteria, better a
> context criteria.

The ultimate is the singular value decomposition approach 
that someone (Geoff?) was suggesting using for a "similar 
documents" search.  I'd really like to see this in HtDig 
eventually.  It moves away from indexing on words at all, 
and instead indexes on an abstract notion of "how often are 
the search words used in similar contexts to the words of 
the target document?"

> ... italian and latin languages, but they
> are for sure more complex, having different affix rules
> and lots more of different tenses.

Good point.  Am I also correct in believing that some 
languages like German have a lot of changes to the stems 
themselves ("schwimen, schwam, geschwomen", "trinken, 
truank, getrunken")?  Is there an approach that can handle 
that much generality?

> As Geoff suggested, we could implement a
> different fuzzy algorithm for the 'Porter stemming' which
> builds a new index (a stemmed one).

Yes, it would be good to have a fuzzy stemming algorithm 
which doesn't simply return a query with (variant1 OR 
variant2 OR ...), but actually searches a stemmed index.  
It would be more efficient if there are lots of different 
forms.

> > word-level indexing, to give (much) smaller inverted
> > files if people don't need phrase searching.
>
> I guess customisation is our goal. In a retrieval phase,
> we'd want to store almost *anything* we can, then maybe
> with different fuzzy algorithms build alternative indexes
> (smaller or bigger, depending on users' settings).

Yes, a document-level inverted file could be generated from 
the word-level one after the whole dig.  I don't know much 
about htdig's fuzzy mechanism yet; is it possible to delete 
the main inverted file and just rely on a "fuzzy" one?  If 
so, the only other disadvantages would be speed and the 
amount of temporary space required (RAM and disk space).

Cheers,
Lachlan

-- 
Lachlan Andrew  Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg          CRICOS Provider Code
University of Melbourne, Victoria, 3010  AUSTRALIA      00116K


-------------------------------------------------------
This SF.net email is sponsored by: Microsoft Visual Studio.NET 
comprehensive development tool, built to increase your 
productivity. Try a free online hosted session at:
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to