This is a refinement of FASD that proposes a simpler form of
inverted index.  It is compact and fast, and is effective for
Asian languages.  I will only briefly sketch out the idea here,
saving the many details for further discussion.

Conceptually, the indexing scheme is as follows:

1) A document (or abstract, description, title ...) to be indexed
   is converted to 32-bit Unicode.

2) All n-grams of length 1 through N are extracted from the Unicode
   text.  I have found N=4 to be a good choice for English, and N=3
   to be a good choice for Japanese.

3) A bit vector is formed by taking each n-gram to represent an
   integer, and setting the corresponding bit on.

4) The bit vector is compressed by some scheme which is compact and
   easily uncompressed.  This bit vector is the document surrogate
   used for retrieval.

Conceptually, the retrieval scheme is as follows:

1) A full-text query is converted to Unicode.

2) All n-grams of length 1 through N are extracted from the Unicode
   text (N need not be the same as for indexing).

3) A bit vector is formed of all n-grams in the query, just as for
   the document surrogate.

4) A compressed form of the bit vector is used as a FASD retrieval
   key. The similarity function is the number of bits that are on in
   the bit-wise AND of the query vector and a document vector.

Many further refinements are possible, such as stop-word pruning,
searches for misspelled words, multiple-vector Boolean queries, and
weighting of search words.  I think that as much as possible these
features should be provided by the client, rather than the many
servers.

I prefer n-gram approaches over word-based approachs because many
languages do not put spaces between words, making parsing of documents
into words very difficult.  Also, searches for phrases are as easy as
searches for words, with no need to store word positions in the index.

The main disadvantage of the n-gram approach is that a search will
retrieve a false positive whenever all of the n-grams within a query
word or phrase appear in a document, but not the word or phrase itself.
In my experience this is not a major problem, and if desired retrieved
documents can be scanned by the client to eliminate false positives.

I have found the use of compressed bit vectors as document surrogates
to be a fast and compact scheme, requiring on average two bits in the
compressed surrogate for each unique word in a document.

I have no references at hand to back up my opinions, but I do have
some twenty years of experience building full-text search systems,
and I do my best to keep up with the relevant academic literature.

   


_______________________________________________
Tech mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org/cgi-bin/mailman/listinfo/tech

Reply via email to