On 07/02/2015 09:24 PM, Kevin Benson wrote:
> FTFY, you're welcome ;-)
> Now *MAYBE* someone can read it and possibly reply.
>
> --
>     --
>        --
>           --???--
>          K e V i N
>
> On Thu, Jul 2, 2015 at 9:11 AM, <shuricksoft at ukr.net> wrote:
>
>> We use SQLite for indexing and searching the text contents of our app
>> using FTS4
>> (about 27k unique words, about 1 million words for the whole contents). In
>> particular,
>> we use the offsets function. Currently, after some testing, we?re
>> experiencing a plenty
>> of problems with finding the results needed.
>>
>> For the forms of the words searching we use the ?all-to-all? way, which
>> means we have
>> some standard English endings and words forms and modify the initial
>> request so that
>> all possible combinations are included and separated by OR from each
>> other.
>>
>> I. e. if we have two forms and two words in the request (for simplicity),
>> that would look
>> like (the MATCH part):
>>
>> ?(word1_form1 NEAR/10 word2_form1)
>> OR (word1_form1 NEAR/10 word2_form2)
>> OR (word1_form2 NEAR/10 word2_form1)
>> OR (word1_form2 NEAR/10 word2_form2)?.
>>
>> Initially, the problem appeared that the query returned offsets for
>> absolutely wrong words.
>> While searching for ?honest fair?, we got words like ?good?, ?sport?,
>> ?natural? offsets.
>> We moved from the system provided (iOS, OS X) SQLite (version 3.8.5,
>> further called
>> ?old SQLite?) to the latest one available on the official site (version
>> 3.8.10.2,
>> further called ?new SQLite? or just ?SQLite?), which solved the issue, but
>> returned offsets
>> for words which were not near (according to NEAR/10). I supposed it?s
>> because the request
>> was too big because of lots of the words forms. But, for example, this
>> request
>>
>> for ?offense is rank?:
>> select documents.name, contents.rowid, part, offsets(contents)
>> from contents left join documents on contents.document_id = documents.id
>> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")'
>> order by document_id
>>
>> returned the correct results, while this one returned false positive
>> results (0 stands for
>> apostrophe so it?s not interpreted as a separate token for the simple
>> tokenizer we use):
>>
>> select documents.name, contents.rowid, part, offsets(contents)
>> from contents left join documents on contents.document_id = documents.id
>> where text match '(\"offense\" NEAR/10 \"is\" NEAR/10 \"rank\")
>> OR (\"offense\" NEAR/10 \"is\" NEAR/10 \"rank0s\")'
>> order by document_id
>>
>> The experiments revealed that ?rank0s? can be whatever: ranks",
>> "rankqwerty" and so on,
>> even "rankqwertyvrfbgbrevwkefovmwpsrvrm" or "yugbuoiipkipnuo?. If it's
>> removed, nothing
>> found. If after that the first word is modified, the one correct result
>> found. Also, a
>> search for ?speak again? didn?t find the results at all, though there
>> should be quite a
>> lot of them.
>>
>> The ?database disk image is malformed? error was reported, but integrity
>> check completed
>> with no errors, the database was recreated and that didn?t help, meanwhile
>> the old SQLite
>> did find the results for this request in exactly that file with no error
>> reported.
>>
>> Also, the new SQLite worked well for the same request if the offsets
>> function was replaced
>> with the snippet function (that also solved the false positive results
>> problem described
>> above). The search for ?father? returned not all results. For example, the
>> first result
>> which can manually be found in the contents table using SQLite Database
>> Browser for this
>> request:
>>
>> select documents.name, contents.rowid, part, offsets(contents)
>> from contents left join documents on contents.document_id = documents.id
>> where text match '(\"father\")
>> OR (\"fathere\") OR (\"fathering\")
>> OR (\"fatherish\") OR (\"fathers\")
>> OR (\"fatheres\") OR (\"fatherian\")
>> OR (\"fatheral\") OR (\"father0s\")'
>> and document_id in (25)
>> order by document_id
>>
>> missed when actually searched, the snippet function doesn?t help here,
>> neither helps
>> searching for a single word form. Error logging with
>>
>> sqlite3_config(SQLITE_CONFIG_LOG, errorLogCallback, NULL)
>>
>> was never called.
>>   Any ideas?


It's difficult to say. The level of detail you have provided is a little 
overwhelming.

If you are using a custom tokenizer and it behaves inconsistently you 
could get these kind of problems. Or it could also be a bug in the 
combination of NEAR, OR and an auxiliary FTS function.

Are you able to post a database online along with a query that returns 
an incorrect result? And the tokenizer implementation too if possible, 
although it might be possible to figure it out without that.

Does "neither helps searching for a single word form" mean that a simple 
query like "text MATCH 'father'" is also failing? If so, that's the one 
to post.

You are compiling with SQLITE_ENABLE_FTS3_PARENTHESIS defined, correct?

Dan.




Reply via email to