Re: [sqlite] FTS2 suggestion

Russell Leighton Thu, 23 Aug 2007 18:54:49 -0700

Could fts3 (the next fts) have the option to override the default'match' function with one passed in (similar to the tokenizer)?

The reason I ask is then the fts table could be used as smart indexwhen the tokenizer issomething like bigram, trigram, etc. and the 'match' function computesa similarity metric

and returns the row if above a threshold.

Postgres does this when you declare an index of type trigram, see:

        http://www.sai.msu.su/~megera/postgres/gist/pg_trgm/README.pg_trgm

Since SQLite does not allow 'plug-in' indexes, the idea would be tocreate an fts3 table with a key back to the main table and the stringcolumn you want index.

Indexing becomes a join through the fts3 table.

You would probably want to allow the user to pass args to the 'match'function so a threshold could be set to non-default values and maybetweak matching options

specific to the match and tokenization.

Thoughts?


On Aug 23, 2007, at 4:56 PM, Scott Hess wrote:

On 8/20/07, Cesar D. Rodas <[EMAIL PROTECTED]> wrote:

As I know ( I can be wrong ) SQLite Full Text Search is only matchwith hole
words right? It could not be
And also no FT extension to db ( as far I know) is miss spelltolerant,


Yes, fts is matching exactly.  There is some primitive support for
English stemming using the Porter stemmer, but, honestly, it's not
well-exercised.

And
I've found this Paper that talks about *Using Superimposed Coding OfN-Gram
Lists For Efficient Inexact Matching*

http://citeseer.ist.psu.edu/cache/papers/cs/22812/http:zSzzSzwww.novodynamics.comzSztrenklezSzpaperszSzatc92v.pdf/william92using.pdf

I was reading and it is not so hard to implement, but it cost a extra
storage space, but I think the benefits are more.
Also following this paper could be done a way to match with fragmentsof
words... what do you think of it?


It's an interesting paper, and I must say that anything which involves
Bloom Filters automatically draws my attention :-).

While I think spelling-suggestion might be valuable for fts in the
longer term, I'm not very enthusiastic about this particular model.
It seems much more useful in the standard indexing model of building
the index, manually tweaking it, and then doing a ton of queries
against it.  fts is really fairly constrained, because many use-cases
are more along the lines of update the index quite a bit, and query it
only a few times.

Also, I think the concepts in the paper might have very significant
problems handling Unicode, because the bit vectors will get so very
large.  I may be wrong, sometimes the overlapping-vector approach can
have surprising relevance depending on the frequency distribution of
the things in the vector.  It would need some experimentation to
figure that out.

Certainly something to bookmark, though.

Thanks,
scott

-----------------------------------------------------------------------------

To unsubscribe, send email to [EMAIL PROTECTED]

-----------------------------------------------------------------------------



-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] FTS2 suggestion

Reply via email to