[sqlite] FTS5 stopwords

2015-09-14 Thread Dan Kennedy
On 09/14/2015 09:13 PM, Abilio Marques wrote:
> ?Hi,
>
> I know I'm a newcomer into the SQLite project, but I'm excited about what
> FTS5 has to offer. To me it seems simple and powerful, and has some really
> nice ideas.
>
> Is it possible for me to contribute on the module, or is it too late for
> that?
>
> I would like to mention two new ideas I would offer to introduce. First, a
> customizable list of stopwords:
>
> https://en.wikipedia.org/wiki/Stop_words
> ?
> (I didn't find anything similar to that in the documentation, am I missing
> something?)
>
> I know I can add it via a custom tokenizer, but wouldn't it be useful to
> have it straight out of the box?

Hi,

I think such a thing would be implemented using the custom tokenizer API 
even if it were shipped as part of FTS5. As a "wrapper tokenizer" 
similar to the built-in porter tokenizer perhaps.

If we had code for a stop-words implementation that seemed like it would 
work for everybody and any licensing issues could be worked out then 
there's no reason something like that couldn't be made part of FTS5.

Dan.




[sqlite] FTS5 stopwords

2015-09-14 Thread Dan Kennedy
On 09/14/2015 09:13 PM, Abilio Marques wrote:
> ?Hi,
>
> I know I'm a newcomer into the SQLite project, but I'm excited about what
> FTS5 has to offer. To me it seems simple and powerful, and has some really
> nice ideas.
>
> Is it possible for me to contribute on the module, or is it too late for
> that?
>
> I would like to mention two new ideas I would offer to introduce. First, a
> customizable list of stopwords:
>
> https://en.wikipedia.org/wiki/Stop_words
> ?
> (I didn't find anything similar to that in the documentation, am I missing
> something?)
>
> I know I can add it via a custom tokenizer, but wouldn't it be useful to
> have it straight out of the box?
>
>
> Also, I would like to mention the usefulness of some statistics to create
> more advanced ranking formulas. Things like: the Longest Common Subsequence
> between query and document, number of unique matched keywords, etc. These
> and other values are really useful in applications where bm25 is not
> suitable or enough.

Hi,

 From an FTS5 custom auxiliary function, there are two ways to find the 
token offset of every phrase match in the current document:

The xInstCount()/xInst() allows random access to an array of matches - 
i.e. give me the phrase number, column and token offset of the Nth match:

   https://www.sqlite.org/draft/fts5.html#xInstCount

And xPhraseFirst()/xPhraseNext() allow the user to iterate through the 
matches for a specific query phrase within the current document:

   https://www.sqlite.org/draft/fts5.html#xPhraseFirst

xPhraseFirst/xPhraseNext is faster, but xInstCount/xInst can be easier 
to use.

It should be possible to build the sorts of things you're talking about 
on top of one of those, no? The example matchinfo() code contains code 
to determine the longest common subsequence here:

   http://www.sqlite.org/src/artifact/e96be827aa8f5?ln=259-281

Feedback from anyone who actually tries to use this API much appreciated.

Dan.



>
> I come from using an engine called Sphinx Search (used on huge things like
> Craigslist), which offers such factors. Using them, they have defined
> rankers that mix bm25 with proximity, and some other they call
> SPH_RANK_SPH04, which includes a weighting boost for the result appearing
> at the beginning of the text field, and a bigger boost if its an exact
> match:
>
> http://sphinxsearch.com/docs/latest/builtin-rankers.html
>
> The formulas (in sphinx higher is better) for them are:
> http://sphinxsearch.com/docs/latest/formulas-for-builtin-rankers.html
>
> And the list of supported factor is:
> http://sphinxsearch.com/docs/latest/ranking-factors.html.
>
> Of course having all of them would be overkill, but if you find them
> interesting, we can get the most useful ones, allowing people to build
> rankers to their own needs.
>
>
> ?Once again, you people are the experts and know if such ideas are feasible
> and where is the right place to include them, so please tell me your
> opinions.
>
>   ?
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



[sqlite] FTS5 stopwords

2015-09-14 Thread Stadin, Benjamin
I?ve implemented a custom ranker in SQLite that is similar to
SPH_RANK_SPH04 using FTS4 (BM25 + word distance and distance to beginning
of text). The only thing that wasn?t possible out of the box using FTS4
was to get the distance between found matches as distance between them
(how many words are between matches). FTS4 callback allows currently only
to get this distance as byte offset, but not word distance.

As far as I remember, there are internal data structures in FTS4 which
would allow this. But these structures aren?t available to the callback.

Anyways, it will be nice if FTS5 would have a feature to get the distance
between matched words expressed as word / token distance.

Cheers
Ben

Am 14.09.15 16:13 schrieb "sqlite-users-bounces at mailinglists.sqlite.org on
behalf of Abilio Marques" unter
:

>SPH_RANK_SPH04



[sqlite] FTS5 stopwords

2015-09-14 Thread Abilio Marques
?Hi,

I know I'm a newcomer into the SQLite project, but I'm excited about what
FTS5 has to offer. To me it seems simple and powerful, and has some really
nice ideas.

Is it possible for me to contribute on the module, or is it too late for
that?

I would like to mention two new ideas I would offer to introduce. First, a
customizable list of stopwords:

https://en.wikipedia.org/wiki/Stop_words
?
(I didn't find anything similar to that in the documentation, am I missing
something?)

I know I can add it via a custom tokenizer, but wouldn't it be useful to
have it straight out of the box?


Also, I would like to mention the usefulness of some statistics to create
more advanced ranking formulas. Things like: the Longest Common Subsequence
between query and document, number of unique matched keywords, etc. These
and other values are really useful in applications where bm25 is not
suitable or enough.

I come from using an engine called Sphinx Search (used on huge things like
Craigslist), which offers such factors. Using them, they have defined
rankers that mix bm25 with proximity, and some other they call
SPH_RANK_SPH04, which includes a weighting boost for the result appearing
at the beginning of the text field, and a bigger boost if its an exact
match:

http://sphinxsearch.com/docs/latest/builtin-rankers.html

The formulas (in sphinx higher is better) for them are:
http://sphinxsearch.com/docs/latest/formulas-for-builtin-rankers.html

And the list of supported factor is:
http://sphinxsearch.com/docs/latest/ranking-factors.html.

Of course having all of them would be overkill, but if you find them
interesting, we can get the most useful ones, allowing people to build
rankers to their own needs.


?Once again, you people are the experts and know if such ideas are feasible
and where is the right place to include them, so please tell me your
opinions.

 ?