Hi,
For my use case I need to index some texts, and I also need be able to assign 
attributes to sub-sequences/spans of tokens in the texts. I want to be able to 
search only for keywords/phrases that occur in spans with a certain attribute. 
As an example, imagine we have a set of rich text documents and we want to find 
the locations of the "SQLite rocks" phrase in that set, but not just any 
instance of it- only those that have an attribute "bold".

The general idea, I'm considering at the moment is as follows:
1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), 
where doc_id is equal to the corresponding rowid in the FTS table, and 
start_tok and end_tok are the 0-based offsets that delimit a span
2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, 
list-of-offsets in that doc)
3) somehow convert the above results into (rowid, start_offset, end_offset) for 
each entry of the list-of-offsets
4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" 
where "spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= 
spans.end_pos"

Do you think this general approach makes sense, and how would you approach the 
problem if not? I only have a cursory knowledge of both SQL and SQLite at this 
point, so it's quite possible I'm missing something obvious.

On the implementation side:
- as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently 
there isn't readily available offsets() function in FTS5 yet, but the API 
should make writing one rather straightforward

- step 3) is what I wonder about the most... What would be a good way to 
convert a (doc_id, <list-of-offsets>) row into (doc_id, start_offset, 
end_offset) tuples for every entry in the list? I'd guess I will have to 
implement some sort of virtual table(a.k.a. "table-valued function")? Perhaps a 
table-valued function that takes a FTS query as its parameter that it then uses 
to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, 
start_offset, end_offset) rows in the virtual table?

Any thoughts and ideas would be greatly appreciated.

Thank you!
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to