subject:"Re\: \[sqlite\] Proximity ranking with FTS"

Re: [sqlite] Proximity ranking with FTS

2014-06-17 Thread Josh Wilson

Let's say I have search phrase "TermA TermB"

matchinfo option 'p' would be 2. 

CREATE VIRTUAL TABLE t1 USING fts4(title, content);

matchinfo option 'c' returns 2 for the number of columns. 

Now consider sample data:

|1|""|"TermA"|
|2|"TermA TermB"|"TermA TermA"|
|3|"TermA TermA TermA"|"TermB"|

matchinfo option 'x' would have ('p' * 'c' * 3) bytes of data per row. 

But each of these are aggregate pieces of information. 

For example in the list of (p)hrase terms I'm interested in the one at index 
[0] :: "TermA" in (c)olumn [0] for when matchinfo is looking at row [3]; I'm 
going to need a list of 3 token positions. But if there were more matches I'm 
going to need N token positions. 

So if matchinfo had a 't' option which is the total number of token hits within 
the row and this is an int then we can have option 'q' which would have the 
following data:

int N = matchinfo[T_OFFSET];
for (int i = 0; i < N; i++)
{
// this is composed on my phone so pardon the poor indenting. 
int phraseTerm = matchinfo[Q_OFFSET + 3*i];
int column = matchinfo[Q_OFFSET + 3*i + 1];
int tokenPosition = matchinfo[Q_OFFSET + 3*i + 2];
}

Again ideally this would be precomputed so matchinfo can maintain its speed in 
forming the BLOB. 

This is similar to how offsets() returns results but the documentation says 
that offsets() is an order of magnitude slower and I'm presuming it is using 
the fts3tokenize() on the matched results and tokenising the data again. 

A quick win would be to make a token_offsets() function that uses the 
fts3tokenise() function to get the values we are after by tokenising the 
results. Technically it'd get the job done but I'd like it to still have the 
speed matchinfo has so the proximity ranking isn't waiting on tokenising 
documents all the time. 

So if it is to be precalculated it will have to be stored in a shadow table 
somewhere and also updated accordingly with FTS4 INSERT, UPDATE and DELETE 
actions. 

Regards

Josh



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p76156.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Proximity ranking with FTS

2014-06-17 Thread Stadin, Benjamin

This info is however not provided to an extension function. It would be
great if the token number ¹t¹ could be added.

Ben

Am 17.06.14 19:36 schrieb "Dan Kennedy" unter :

>On 06/17/2014 10:48 AM, Josh Wilson wrote:
>> Yeah I had thought about using the byte distance between words but you
>>get
>> these instances:
>>
>> [Example A]
>> |word1|10charword|word2|
>>
>> [Example B]
>> |word1|3charword|4charword|3charword|word2|
>>
>> By using byte distances, both of these score the same, where Example A
>> should score more highly.
>>
>> But it would seem I can use the fts3_tokenizer somehow to get the token
>> positions or that this underlying value is available but just not
>>stored in
>> an accessible manner.
>
>I think it's possible to do. When it visits a row as part of a full-text
>search, internally FTS has a list of matches within the current row for
>each phrase in the query. Each match is stored as a column and token
>offset - the number of tokens that precede the match within the column
>text.
>
>Is that what you need? Do you have any ideas for an fts4 interface it?
>
>Dan.
>
>
>
>
>>
>> I implemented OkapiBM25f [1] but was hoping to implement something like
>>the
>> following proximity ranking [2] as it combines Bag-Of-Words ranking and
>> proximity ranking. Although that article proposes to precalculate the
>> distance pairs for all tokens, I'm happy to accept the TimeCost and
>> calculate on the fly as that SpaceCost won't be worth it.
>>
>> [1] https://github.com/neozenith/sqlite-okapi-bm25
>> [2] http://infolab.stanford.edu/~theobald/pub/proximity-spire07.pdf
>>
>>
>>
>> --
>> View this message in context:
>>http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p76
>>152.html
>> Sent from the SQLite mailing list archive at Nabble.com.
>> ___
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
>___
>sqlite-users mailing list
>sqlite-users@sqlite.org
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Proximity ranking with FTS

2014-06-17 Thread Dan Kennedy


On 06/17/2014 10:48 AM, Josh Wilson wrote:

Yeah I had thought about using the byte distance between words but you get
these instances:

[Example A]
|word1|10charword|word2|

[Example B]
|word1|3charword|4charword|3charword|word2|

By using byte distances, both of these score the same, where Example A
should score more highly.

But it would seem I can use the fts3_tokenizer somehow to get the token
positions or that this underlying value is available but just not stored in
an accessible manner.


I think it's possible to do. When it visits a row as part of a full-text 
search, internally FTS has a list of matches within the current row for 
each phrase in the query. Each match is stored as a column and token 
offset - the number of tokens that precede the match within the column text.


Is that what you need? Do you have any ideas for an fts4 interface it?

Dan.






I implemented OkapiBM25f [1] but was hoping to implement something like the
following proximity ranking [2] as it combines Bag-Of-Words ranking and
proximity ranking. Although that article proposes to precalculate the
distance pairs for all tokens, I'm happy to accept the TimeCost and
calculate on the fly as that SpaceCost won't be worth it.

[1] https://github.com/neozenith/sqlite-okapi-bm25
[2] http://infolab.stanford.edu/~theobald/pub/proximity-spire07.pdf



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p76152.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Proximity ranking with FTS

2014-06-16 Thread Josh Wilson

Yeah I had thought about using the byte distance between words but you get
these instances:

[Example A]
|word1|10charword|word2|

[Example B]
|word1|3charword|4charword|3charword|word2|

By using byte distances, both of these score the same, where Example A
should score more highly.

But it would seem I can use the fts3_tokenizer somehow to get the token
positions or that this underlying value is available but just not stored in
an accessible manner.

I implemented OkapiBM25f [1] but was hoping to implement something like the
following proximity ranking [2] as it combines Bag-Of-Words ranking and
proximity ranking. Although that article proposes to precalculate the
distance pairs for all tokens, I'm happy to accept the TimeCost and
calculate on the fly as that SpaceCost won't be worth it.

[1] https://github.com/neozenith/sqlite-okapi-bm25
[2] http://infolab.stanford.edu/~theobald/pub/proximity-spire07.pdf



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p76152.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Proximity ranking with FTS

2014-06-16 Thread Stadin, Benjamin

This info is not provided by the API (or I missed it). I required this for
a ranking that considers both the distance of N matches to each other, and
the offset of those matches to the beginning of a text. I worked around
this by giving a boost for the byte distance of the words (taking into
account the length in bytes of a phrase, phrases with less byte distance
to each other as provided in the order of their appearance of the match
statement get a slightly higher boost).

This worked Ok as approximation. Depending how your ranking needs to be
calculated, you¹d however maybe better use a custom extra fts info table
and fill this info yourself.


Am 17.06.14 02:58 schrieb "Josh Wilson" unter :

>Ok so I have  found the following guides:
>
>http://www.sqlite.org/privatebranch.html
>
>http://www.sqlite.org/src/doc/trunk/README.md
>
>So as far as creating a private branch 'the proper way' this should be
>sufficient but as far as getting the token positions for FTS MATCH results
>any advice on the topic is appreciated.
>
>
>
>--
>View this message in context:
>http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p761
>50.html
>Sent from the SQLite mailing list archive at Nabble.com.
>___
>sqlite-users mailing list
>sqlite-users@sqlite.org
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Proximity ranking with FTS

2014-06-16 Thread Josh Wilson

Ok so I have  found the following guides:

http://www.sqlite.org/privatebranch.html

http://www.sqlite.org/src/doc/trunk/README.md

So as far as creating a private branch 'the proper way' this should be
sufficient but as far as getting the token positions for FTS MATCH results
any advice on the topic is appreciated.



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/Proximity-ranking-with-FTS-tp76149p76150.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Proximity ranking with FTS

Re: [sqlite] Proximity ranking with FTS

Re: [sqlite] Proximity ranking with FTS

Re: [sqlite] Proximity ranking with FTS

Re: [sqlite] Proximity ranking with FTS

Re: [sqlite] Proximity ranking with FTS

6 matches

Site Navigation

Mail list logo

Footer information