On 07/02/2015 09:24 PM, Kevin Benson wrote: > FTFY, you're welcome ;-) > Now *MAYBE* someone can read it and possibly reply. > > -- > -- > -- > --???-- > K e V i N > > On Thu, Jul 2, 2015 at 9:11 AM, <shuricksoft at ukr.net> wrote: > >> We use SQLite for indexing and searching the text contents of our app >> using FTS4 >> (about 27k unique words, about 1 million words for the whole contents). In >> particular, >> we use the offsets function. Currently, after some testing, we?re >> experiencing a plenty >> of problems with finding the results needed. >> >> For the forms of the words searching we use the ?all-to-all? way, which >> means we have >> some standard English endings and words forms and modify the initial >> request so that >> all possible combinations are included and separated by OR from each >> other. >> >> I. e. if we have two forms and two words in the request (for simplicity), >> that would look >> like (the MATCH part): >> >> ?(word1_form1 NEAR/10 word2_form1) >> OR (word1_form1 NEAR/10 word2_form2) >> OR (word1_form2 NEAR/10 word2_form1) >> OR (word1_form2 NEAR/10 word2_form2)?. >> >> Initially, the problem appeared that the query returned offsets for >> absolutely wrong words. >> While searching for ?honest fair?, we got words like ?good?, ?sport?, >> ?natural? offsets. >> We moved from the system provided (iOS, OS X) SQLite (version 3.8.5, >> further called >> ?old SQLite?) to the latest one available on the official site (version >> 3.8.10.2, >> further called ?new SQLite? or just ?SQLite?), which solved the issue, but >> returned offsets >> for words which were not near (according to NEAR/10). I supposed it?s >> because the request >> was too big because of lots of the words forms. But, for example, this >> request >> >> for ?offense is rank?: >> select documents.name, contents.rowid, part, offsets(contents) >> from contents left join documents on contents.document_id = documents.id >> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")' >> order by document_id >> >> returned the correct results, while this one returned false positive >> results (0 stands for >> apostrophe so it?s not interpreted as a separate token for the simple >> tokenizer we use): >> >> select documents.name, contents.rowid, part, offsets(contents) >> from contents left join documents on contents.document_id = documents.id >> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\") >> OR (\"offense\" NEAR/10 \"is\" NEAR/10 \"rank0s\")' >> order by document_id >> >> The experiments revealed that ?rank0s? can be whatever: ranks", >> "rankqwerty" and so on, >> even "rankqwertyvrfbgbrevwkefovmwpsrvrm" or "yugbuoiipkipnuo?. If it's >> removed, nothing >> found. If after that the first word is modified, the one correct result >> found. Also, a >> search for ?speak again? didn?t find the results at all, though there >> should be quite a >> lot of them. >> >> The ?database disk image is malformed? error was reported, but integrity >> check completed >> with no errors, the database was recreated and that didn?t help, meanwhile >> the old SQLite >> did find the results for this request in exactly that file with no error >> reported. >> >> Also, the new SQLite worked well for the same request if the offsets >> function was replaced >> with the snippet function (that also solved the false positive results >> problem described >> above). The search for ?father? returned not all results. For example, the >> first result >> which can manually be found in the contents table using SQLite Database >> Browser for this >> request: >> >> select documents.name, contents.rowid, part, offsets(contents) >> from contents left join documents on contents.document_id = documents.id >> where text match '(\"father\") >> OR (\"fathere\") OR (\"fathering\") >> OR (\"fatherish\") OR (\"fathers\") >> OR (\"fatheres\") OR (\"fatherian\") >> OR (\"fatheral\") OR (\"father0s\")' >> and document_id in (25) >> order by document_id >> >> missed when actually searched, the snippet function doesn?t help here, >> neither helps >> searching for a single word form. Error logging with >> >> sqlite3_config(SQLITE_CONFIG_LOG, errorLogCallback, NULL) >> >> was never called. >> Any ideas?
It's difficult to say. The level of detail you have provided is a little overwhelming. If you are using a custom tokenizer and it behaves inconsistently you could get these kind of problems. Or it could also be a bug in the combination of NEAR, OR and an auxiliary FTS function. Are you able to post a database online along with a query that returns an incorrect result? And the tokenizer implementation too if possible, although it might be possible to figure it out without that. Does "neither helps searching for a single word form" mean that a simple query like "text MATCH 'father'" is also failing? If so, that's the one to post. You are compiling with SQLITE_ENABLE_FTS3_PARENTHESIS defined, correct? Dan.