[sqlite] Constraining FTS5 results based on offsets()

2018-04-08 Thread miroslav.marango...@outlook.com
Hi,
For my use case I need to index some texts, and I also need be able to assign 
attributes to sub-sequences/spans of tokens in the texts. I want to be able to 
search only for keywords/phrases that occur in spans with a certain attribute. 
As an example, imagine we have a set of rich text documents and we want to find 
the locations of the "SQLite rocks" phrase in that set, but not just any 
instance of it- only those that have an attribute "bold".

The general idea, I'm considering at the moment is as follows:
1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), 
where doc_id is equal to the corresponding rowid in the FTS table, and 
start_tok and end_tok are the 0-based offsets that delimit a span
2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, 
list-of-offsets in that doc)
3) somehow convert the above results into (rowid, start_offset, end_offset) for 
each entry of the list-of-offsets
4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" 
where "spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= 
spans.end_pos"

Do you think this general approach makes sense, and how would you approach the 
problem if not? I only have a cursory knowledge of both SQL and SQLite at this 
point, so it's quite possible I'm missing something obvious.

On the implementation side:
- as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently 
there isn't readily available offsets() function in FTS5 yet, but the API 
should make writing one rather straightforward

- step 3) is what I wonder about the most... What would be a good way to 
convert a (doc_id, ) row into (doc_id, start_offset, 
end_offset) tuples for every entry in the list? I'd guess I will have to 
implement some sort of virtual table(a.k.a. "table-valued function")? Perhaps a 
table-valued function that takes a FTS query as its parameter that it then uses 
to obtain (doc_id, ), and expose each hit as (doc_id, 
start_offset, end_offset) rows in the virtual table?

Any thoughts and ideas would be greatly appreciated.

Thank you!
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Constraining FTS5 results based on offsets()

2018-04-10 Thread Dan Kennedy

On 04/08/2018 02:00 PM, miroslav.marango...@outlook.com wrote:

Hi,
For my use case I need to index some texts, and I also need be able to assign attributes to 
sub-sequences/spans of tokens in the texts. I want to be able to search only for keywords/phrases 
that occur in spans with a certain attribute. As an example, imagine we have a set of rich text 
documents and we want to find the locations of the "SQLite rocks" phrase in that set, but 
not just any instance of it- only those that have an attribute "bold".

The general idea, I'm considering at the moment is as follows:
1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), 
where doc_id is equal to the corresponding rowid in the FTS table, and 
start_tok and end_tok are the 0-based offsets that delimit a span
2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, 
list-of-offsets in that doc)
3) somehow convert the above results into (rowid, start_offset, end_offset) for 
each entry of the list-of-offsets
4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" where 
"spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= spans.end_pos"

Do you think this general approach makes sense, and how would you approach the 
problem if not? I only have a cursory knowledge of both SQL and SQLite at this 
point, so it's quite possible I'm missing something obvious.

On the implementation side:
- as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently 
there isn't readily available offsets() function in FTS5 yet, but the API 
should make writing one rather straightforward

- step 3) is what I wonder about the most... What would be a good way to convert a (doc_id, 
) row into (doc_id, start_offset, end_offset) tuples for every entry in the 
list? I'd guess I will have to implement some sort of virtual table(a.k.a. "table-valued 
function")? Perhaps a table-valued function that takes a FTS query as its parameter that it then 
uses to obtain (doc_id, ), and expose each hit as (doc_id, start_offset, 
end_offset) rows in the virtual table?

Any thoughts and ideas would be greatly appreciated.


Maybe just implement an FTS5 function to do the filtering in step 4 
above. So that your query looks like:


  SELECT * FROM fts5tbl('SQLite+rocks'), spans
  WHERE ft.rowid=spans.doc_id
AND spans.attrib_id = 'bold'
AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

where custom_fts5_function() returns true if there are any phrase hits 
with token offsets between its two trailing arguments. Or whatever it is 
you require.


  https://sqlite.org/fts5.html#custom_auxiliary_functions

Dan.







Thank you!
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Constraining FTS5 results based on offsets()

2018-04-10 Thread Miroslav Marangozov
Thank you Dan!
This should be much easier to implement than a full blown virtual table, but as 
far as I can tell(as I've mentioned, I'm not an expert in this) it doesn't do 
exactly what I want. I need all the locations of the search phrases, subject to 
the attribute constraint, in a document, but this query will only return the 
spans that contain one or more instance of the phrase. Or maybe I'm missing 
something?

Provided my understanding is correct, what do you think would be most the 
pragmatic way to change the query, so that it returns the locations of the 
phrases? What I imagine could work, even though it's admittedly ugly, is to 
write another UDF, let's call it "hacky_offsets()", that has some sort of 
"private" communication channel with custom_fts5_function(). So the query will 
be something like:

   SELECT fts5tbl.rowid, hacky_offsets()
   FROM fts5tbl('SQLite+rocks'), spans
   WHERE ft5tbl.rowid=spans.doc_id
AND spans.attrib_id = 'bold'
AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

When custom_fts5_function() finds instances of the phrases that fall between 
start_tok/end_tok it records these instances into a location hacky_offsets() 
knows about, so it can read them from there and return them to the application.
Would this even work and can you think of a more idiomatic/elegant 
implementation?

Thanks!



From: sqlite-users  on behalf of 
Dan Kennedy 
Sent: Tuesday, April 10, 2018 11:49 AM
To: sqlite-users@mailinglists.sqlite.org
Subject: Re: [sqlite] Constraining FTS5 results based on offsets()

On 04/08/2018 02:00 PM, miroslav.marango...@outlook.com wrote:
> Hi,
> For my use case I need to index some texts, and I also need be able to assign 
> attributes to sub-sequences/spans of tokens in the texts. I want to be able 
> to search only for keywords/phrases that occur in spans with a certain 
> attribute. As an example, imagine we have a set of rich text documents and we 
> want to find the locations of the "SQLite rocks" phrase in that set, but not 
> just any instance of it- only those that have an attribute "bold".
>
> The general idea, I'm considering at the moment is as follows:
> 1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, 
> end_tok), where doc_id is equal to the corresponding rowid in the FTS table, 
> and start_tok and end_tok are the 0-based offsets that delimit a span
> 2) issue a match (sub?)query against the FTS table and obtain a list of 
> (rowid, list-of-offsets in that doc)
> 3) somehow convert the above results into (rowid, start_offset, end_offset) 
> for each entry of the list-of-offsets
> 4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" 
> where "spans.attrib_id=X and start_offset >= spans.start_pos and end_offset 
> <= spans.end_pos"
>
> Do you think this general approach makes sense, and how would you approach 
> the problem if not? I only have a cursory knowledge of both SQL and SQLite at 
> this point, so it's quite possible I'm missing something obvious.
>
> On the implementation side:
> - as far as I understand FTS5 has some clear advantages over FTS3/4. 
> Apparently there isn't readily available offsets() function in FTS5 yet, but 
> the API should make writing one rather straightforward
>
> - step 3) is what I wonder about the most... What would be a good way to 
> convert a (doc_id, ) row into (doc_id, start_offset, 
> end_offset) tuples for every entry in the list? I'd guess I will have to 
> implement some sort of virtual table(a.k.a. "table-valued function")? Perhaps 
> a table-valued function that takes a FTS query as its parameter that it then 
> uses to obtain (doc_id, ), and expose each hit as (doc_id, 
> start_offset, end_offset) rows in the virtual table?
>
> Any thoughts and ideas would be greatly appreciated.

Maybe just implement an FTS5 function to do the filtering in step 4
above. So that your query looks like:

   SELECT * FROM fts5tbl('SQLite+rocks'), spans
   WHERE ft.rowid=spans.doc_id
 AND spans.attrib_id = 'bold'
 AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

where custom_fts5_function() returns true if there are any phrase hits
with token offsets between its two trailing arguments. Or whatever it is
you require.

   https://sqlite.org/fts5.html#custom_auxiliary_functions

Dan.





>
> Thank you!
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users