On 2016-01-07 19:31, Mario M. Westphal wrote:
> I hence wonder if this problem has been tackled already and if there is a
> "standard" solution. 

If I understand you correctly, it seems that you are looking for a
compound splitting or decompounding algorithm. Unfortunately there is
not a "standard solution" for this. There are many languages in the
world and for some usable compound splitting algorithms exist. There are
also attempts to create statistical universal algorithms.

As you said, for English a simple sub-string search might suffice but
for other languages it more complex. I assume that you speak German. If
you have a document that contains the term "Verkehrsleitsystem" and your
search query is "Verkehr leiten", it's reasonable to assume that the
document is relevant to the search query. Unfortunately a sub-string
search could not find the document. Other languages are even more
difficult (a textbook on linguistics will explain this better than I can).

Even if you have such algorithm, it's not trivial to score the results
and there are more aspects to consider to create a simple search
algorithm. For example, in English you will also have to do some
analysis of the phrase structure to identify open compounds.

Perhaps it helps to mention the languages you are interested in and the
application you have in mind to evaluate whether the SQLite FTS5 could
meet your requirements.

Reply via email to