[sqlite] Some FTS5 guidance

Scott Hess Fri, 8 Jan 2016 12:43:15 -0800

With fts4 you could search for matching terms in an fts4aux table, then use
those to construct a query against the original table.  You'd have a full
scan of the fts index, but you'd not have to do a full table scan of the
primary data.  Unfortunately if there were a large number of hits in the
index scan, then it would be cheaper to just do the full table scan and
skip the index scan.  I don't know if there's a similar thing for fts5 at
this time.


This wouldn't be as efficient as something more suited to substring matches
(an N-gram index, maybe?), but I haven't heard anyone talking about writing
a virtual table to do that.

-scott


On Fri, Jan 8, 2016 at 11:54 AM, Charles Leifer <coleifer at gmail.com> wrote:

> You can create a custom tokenizer as well then use the standard search
> APIs. I imagine that functionality would work well in this case:
> https://sqlite.org/fts5.html#section_7
>
> On Thu, Jan 7, 2016 at 3:59 PM, Stadin, Benjamin <
> Benjamin.Stadin at heidelberg-mobil.com> wrote:
>
> > One such algorithm would be a (generalized) Ukkonnen suffix tree (
> > https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm).
> > It allows you to search efficiently for substrings.
> > It would be possible to do some match weigthing based on match distance
> > within words. But a general solution for a database is probably not
> trivial
> > to implement.
> >
> > Ben
> >
> > Von meinem iPad gesendet
> >
> > > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott <ott at mirix.org>:
> > >
> > >> On 2016-01-07 19:31, Mario M. Westphal wrote:
> > >> I hence wonder if this problem has been tackled already and if there
> is
> > a
> > >> "standard" solution.
> > >
> > > If I understand you correctly, it seems that you are looking for a
> > > compound splitting or decompounding algorithm. Unfortunately there is
> > > not a "standard solution" for this. There are many languages in the
> > > world and for some usable compound splitting algorithms exist. There
> are
> > > also attempts to create statistical universal algorithms.
> > >
> > > As you said, for English a simple sub-string search might suffice but
> > > for other languages it more complex. I assume that you speak German. If
> > > you have a document that contains the term "Verkehrsleitsystem" and
> your
> > > search query is "Verkehr leiten", it's reasonable to assume that the
> > > document is relevant to the search query. Unfortunately a sub-string
> > > search could not find the document. Other languages are even more
> > > difficult (a textbook on linguistics will explain this better than I
> > can).
> > >
> > > Even if you have such algorithm, it's not trivial to score the results
> > > and there are more aspects to consider to create a simple search
> > > algorithm. For example, in English you will also have to do some
> > > analysis of the phrase structure to identify open compounds.
> > >
> > > Perhaps it helps to mention the languages you are interested in and the
> > > application you have in mind to evaluate whether the SQLite FTS5 could
> > > meet your requirements.
> > > _______________________________________________
> > > sqlite-users mailing list
> > > sqlite-users at mailinglists.sqlite.org
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > sqlite-users at mailinglists.sqlite.org
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

[sqlite] Some FTS5 guidance

Reply via email to