We use SQLite for indexing and searching the text contents of our app using
FTS4 (about 27k unique words, about 1 million words for the whole contents). In
particular, we use the offsets function. Currently, after some testing, we?re
experiencing a plenty of problems with finding the results needed. For the
forms of the words searching we use the ?all-to-all? way, which means we have
some standard English endings and words forms and modify the initial request so
that all possible combinations are included and separated by OR from each
other. I. e. if we have two forms and two words in the request (for
simplicity), that would look like (the MATCH part): ?(word1_form1 NEAR/10
word2_form1) OR (word1_form1 NEAR/10 word2_form2) OR (word1_form2 NEAR/10
word2_form1) OR (word1_form2 NEAR/10 word2_form2)?. Initially, the problem
appeared that the query returned offsets for absolutely wrong words. While
searching for ?honest fair?, we got words like ?good?, ?sport?, ?natural?
offsets. We moved from the system provided (iOS, OS X) SQLite (version 3.8.5,
further called ?old SQLite?) to the latest one available on the official site
(version 3.8.10.2, further called ?new SQLite? or just ?SQLite?), which solved
the issue, but returned offsets for words which were not near (according to
NEAR/10). I supposed it?s because the request was too big because of lots of
the words forms. But, for example, this request for ?offense is rank?: select
documents.name, contents.rowid, part, offsets(contents) from contents left join
documents on contents.document_id = documents.id where text match '(\"offense\"
NEAR/10 \"is\" NEAR/10 \"rank\")' order by document_id returned the correct
results, while this one returned false positive results (0 stands for
apostrophe so it?s not interpreted as a separate token for the simple tokenizer
we use): select documents.name, contents.rowid, part, offsets(contents) from
contents left join documents on contents.document_id = documents.id where text
match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\") OR (\"offense\" NEAR/10
\"is\" NEAR/10 \"rank0s\")' order by document_id The experiments revealed that
?rank0s? can be whatever: ranks", "rankqwerty" and so on, even
"rankqwertyvrfbgbrevwkefovmwpsrvrm" or "yugbuoiipkipnuo?. If it's removed,
nothing found. If after that the first word is modified, the one correct result
found. Also, a search for ?speak again? didn?t find the results at all, though
there should be quite a lot of them. The ?database disk image is malformed?
error was reported, but integrity check completed with no errors, the database
was recreated and that didn?t help, meanwhile the old SQLite did find the
results for this request in exactly that file with no error reported. Also, the
new SQLite worked well for the same request if the offsets function was
replaced with the snippet function (that also solved the false positive results
problem described above). The search for ?father? returned not all results. For
example, the first result which can manually be found in the contents table
using SQLite Database Browser for this request: select documents.name,
contents.rowid, part, offsets(contents) from contents left join documents on
contents.document_id = documents.id where text match '(\"father\") OR
(\"fathere\") OR (\"fathering\") OR (\"fatherish\") OR (\"fathers\") OR
(\"fatheres\") OR (\"fatherian\") OR (\"fatheral\") OR (\"father0s\")' and
document_id in (25) order by document_id missed when actually searched, the
snippet function doesn?t help here, neither helps searching for a single word
form. Error logging with sqlite3_config(SQLITE_CONFIG_LOG, errorLogCallback,
NULL) was never called.
Any ideas?