[sqlite] FTS5 stopwords
On 09/14/2015 09:13 PM, Abilio Marques wrote: > ?Hi, > > I know I'm a newcomer into the SQLite project, but I'm excited about what > FTS5 has to offer. To me it seems simple and powerful, and has some really > nice ideas. > > Is it possible for me to contribute on the module, or is it too late for > that? > > I would like to mention two new ideas I would offer to introduce. First, a > customizable list of stopwords: > > https://en.wikipedia.org/wiki/Stop_words > ? > (I didn't find anything similar to that in the documentation, am I missing > something?) > > I know I can add it via a custom tokenizer, but wouldn't it be useful to > have it straight out of the box? Hi, I think such a thing would be implemented using the custom tokenizer API even if it were shipped as part of FTS5. As a "wrapper tokenizer" similar to the built-in porter tokenizer perhaps. If we had code for a stop-words implementation that seemed like it would work for everybody and any licensing issues could be worked out then there's no reason something like that couldn't be made part of FTS5. Dan.
[sqlite] FTS5 stopwords
On 09/14/2015 09:13 PM, Abilio Marques wrote: > ?Hi, > > I know I'm a newcomer into the SQLite project, but I'm excited about what > FTS5 has to offer. To me it seems simple and powerful, and has some really > nice ideas. > > Is it possible for me to contribute on the module, or is it too late for > that? > > I would like to mention two new ideas I would offer to introduce. First, a > customizable list of stopwords: > > https://en.wikipedia.org/wiki/Stop_words > ? > (I didn't find anything similar to that in the documentation, am I missing > something?) > > I know I can add it via a custom tokenizer, but wouldn't it be useful to > have it straight out of the box? > > > Also, I would like to mention the usefulness of some statistics to create > more advanced ranking formulas. Things like: the Longest Common Subsequence > between query and document, number of unique matched keywords, etc. These > and other values are really useful in applications where bm25 is not > suitable or enough. Hi, From an FTS5 custom auxiliary function, there are two ways to find the token offset of every phrase match in the current document: The xInstCount()/xInst() allows random access to an array of matches - i.e. give me the phrase number, column and token offset of the Nth match: https://www.sqlite.org/draft/fts5.html#xInstCount And xPhraseFirst()/xPhraseNext() allow the user to iterate through the matches for a specific query phrase within the current document: https://www.sqlite.org/draft/fts5.html#xPhraseFirst xPhraseFirst/xPhraseNext is faster, but xInstCount/xInst can be easier to use. It should be possible to build the sorts of things you're talking about on top of one of those, no? The example matchinfo() code contains code to determine the longest common subsequence here: http://www.sqlite.org/src/artifact/e96be827aa8f5?ln=259-281 Feedback from anyone who actually tries to use this API much appreciated. Dan. > > I come from using an engine called Sphinx Search (used on huge things like > Craigslist), which offers such factors. Using them, they have defined > rankers that mix bm25 with proximity, and some other they call > SPH_RANK_SPH04, which includes a weighting boost for the result appearing > at the beginning of the text field, and a bigger boost if its an exact > match: > > http://sphinxsearch.com/docs/latest/builtin-rankers.html > > The formulas (in sphinx higher is better) for them are: > http://sphinxsearch.com/docs/latest/formulas-for-builtin-rankers.html > > And the list of supported factor is: > http://sphinxsearch.com/docs/latest/ranking-factors.html. > > Of course having all of them would be overkill, but if you find them > interesting, we can get the most useful ones, allowing people to build > rankers to their own needs. > > > ?Once again, you people are the experts and know if such ideas are feasible > and where is the right place to include them, so please tell me your > opinions. > > ? > ___ > sqlite-users mailing list > sqlite-users at mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] FTS5 stopwords
I?ve implemented a custom ranker in SQLite that is similar to SPH_RANK_SPH04 using FTS4 (BM25 + word distance and distance to beginning of text). The only thing that wasn?t possible out of the box using FTS4 was to get the distance between found matches as distance between them (how many words are between matches). FTS4 callback allows currently only to get this distance as byte offset, but not word distance. As far as I remember, there are internal data structures in FTS4 which would allow this. But these structures aren?t available to the callback. Anyways, it will be nice if FTS5 would have a feature to get the distance between matched words expressed as word / token distance. Cheers Ben Am 14.09.15 16:13 schrieb "sqlite-users-bounces at mailinglists.sqlite.org on behalf of Abilio Marques" unter : >SPH_RANK_SPH04
[sqlite] FTS5 stopwords
?Hi, I know I'm a newcomer into the SQLite project, but I'm excited about what FTS5 has to offer. To me it seems simple and powerful, and has some really nice ideas. Is it possible for me to contribute on the module, or is it too late for that? I would like to mention two new ideas I would offer to introduce. First, a customizable list of stopwords: https://en.wikipedia.org/wiki/Stop_words ? (I didn't find anything similar to that in the documentation, am I missing something?) I know I can add it via a custom tokenizer, but wouldn't it be useful to have it straight out of the box? Also, I would like to mention the usefulness of some statistics to create more advanced ranking formulas. Things like: the Longest Common Subsequence between query and document, number of unique matched keywords, etc. These and other values are really useful in applications where bm25 is not suitable or enough. I come from using an engine called Sphinx Search (used on huge things like Craigslist), which offers such factors. Using them, they have defined rankers that mix bm25 with proximity, and some other they call SPH_RANK_SPH04, which includes a weighting boost for the result appearing at the beginning of the text field, and a bigger boost if its an exact match: http://sphinxsearch.com/docs/latest/builtin-rankers.html The formulas (in sphinx higher is better) for them are: http://sphinxsearch.com/docs/latest/formulas-for-builtin-rankers.html And the list of supported factor is: http://sphinxsearch.com/docs/latest/ranking-factors.html. Of course having all of them would be overkill, but if you find them interesting, we can get the most useful ones, allowing people to build rankers to their own needs. ?Once again, you people are the experts and know if such ideas are feasible and where is the right place to include them, so please tell me your opinions. ?