[sqlite] FTS5 stopwords

Dan Kennedy Mon, 14 Sep 2015 21:33:07 +0700

On 09/14/2015 09:13 PM, Abilio Marques wrote:
> ?Hi,
>
> I know I'm a newcomer into the SQLite project, but I'm excited about what
> FTS5 has to offer. To me it seems simple and powerful, and has some really
> nice ideas.
>
> Is it possible for me to contribute on the module, or is it too late for
> that?
>
> I would like to mention two new ideas I would offer to introduce. First, a
> customizable list of stopwords:
>
> https://en.wikipedia.org/wiki/Stop_words
> ?
> (I didn't find anything similar to that in the documentation, am I missing
> something?)
>
> I know I can add it via a custom tokenizer, but wouldn't it be useful to
> have it straight out of the box?
>
>
> Also, I would like to mention the usefulness of some statistics to create
> more advanced ranking formulas. Things like: the Longest Common Subsequence
> between query and document, number of unique matched keywords, etc. These
> and other values are really useful in applications where bm25 is not
> suitable or enough.


Hi,

 From an FTS5 custom auxiliary function, there are two ways to find the 
token offset of every phrase match in the current document:

The xInstCount()/xInst() allows random access to an array of matches - 
i.e. give me the phrase number, column and token offset of the Nth match:

   https://www.sqlite.org/draft/fts5.html#xInstCount

And xPhraseFirst()/xPhraseNext() allow the user to iterate through the 
matches for a specific query phrase within the current document:

   https://www.sqlite.org/draft/fts5.html#xPhraseFirst

xPhraseFirst/xPhraseNext is faster, but xInstCount/xInst can be easier 
to use.

It should be possible to build the sorts of things you're talking about 
on top of one of those, no? The example matchinfo() code contains code 
to determine the longest common subsequence here:

   http://www.sqlite.org/src/artifact/e96be827aa8f5?ln=259-281

Feedback from anyone who actually tries to use this API much appreciated.

Dan.



>
> I come from using an engine called Sphinx Search (used on huge things like
> Craigslist), which offers such factors. Using them, they have defined
> rankers that mix bm25 with proximity, and some other they call
> SPH_RANK_SPH04, which includes a weighting boost for the result appearing
> at the beginning of the text field, and a bigger boost if its an exact
> match:
>
> http://sphinxsearch.com/docs/latest/builtin-rankers.html
>
> The formulas (in sphinx higher is better) for them are:
> http://sphinxsearch.com/docs/latest/formulas-for-builtin-rankers.html
>
> And the list of supported factor is:
> http://sphinxsearch.com/docs/latest/ranking-factors.html.
>
> Of course having all of them would be overkill, but if you find them
> interesting, we can get the most useful ones, allowing people to build
> rankers to their own needs.
>
>
> ?Once again, you people are the experts and know if such ideas are feasible
> and where is the right place to include them, so please tell me your
> opinions.
>
>   ?
> _______________________________________________
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] FTS5 stopwords

Reply via email to