FTS indexing is something I hear quite often nowadays. I’ve added some hacks to 
make it work better for some installations, but it’s about time to think about 
the whole design and how it could be improved for everyone in future. Here are 
some of my initial thoughts.

Currently Dovecot supports 3 full text search engines: Solr, CLucene and 
Dovecot Squat. CLucene plugin has various features built in, which should have 
been built in a generic way to work with all the engines (although Solr has 
most of those already built-in). Squat was abandoned a few years ago in favor 
of Solr/CLucene, but perhaps it could be brought back to life, since it looks 
like its index sizes could be smaller than Lucene's.

Here's a list of things that should be added to generic Dovecot FTS code to 
improve all the backends:

1. Support for multiple languages. Use textcat while indexing to guess the 
language of the indexed data. (Perhaps run it separately for each paragraph to 
handle multi-language mails? Or at least many emails begin/end with different 
language than the text in the middle, e.g. "Foo Bar wrote:" is often in various 
languages.) Index the data using the detected language's stemming and other 
features. Keep track of which languages have been used in the index, and when 
searching stem the search words to all the used languages. Since each added 
language requires additional searches and there's the possibility of wrong 
detection, the list of allowed languages could be configurable. See also 
http://ntextcat.codeplex.com/ or at least change textcat to use UTF8.

2. Word stemming. This can be done for many languages with Snowball library. 
Solr has also implemented several other languages, perhaps its code can be 
somehow automatically translated to C(++) for use with Dovecot?

3. Don't index language-specific stopwords. We can get the word lists from e.g. 
Solr.

4. Try to detect compound words and index each part separately for languages 
that use them. http://wiki.apache.org/solr/LanguageAnalysis#Decompounding 
suggests two possible ways to do it.

5. Normalize words (e.g. drop diacritics). libicu can be used for this.

6. Drop (Unicode) characters that don't belong to the language? Or especially 
don't index most of the weird Unicode characters. This would avoid filling the 
index with unnecessary garbage.

7. Don't index non-text data? For example if there is large block of base64 
data or something else that definitely doesn't look like text, it's pretty 
useless to index it. Then again, we do want to index all kinds of IDs that 
someone might want to search. This could be a bit difficult to implement well.

8. Index attachments separately, so it would be possible to search only 
attachments. (Should "SEARCH BODY word1 BODY word2" return matches if word1 and 
word2 are in different attachments?)

9. Attachments can be translated to indexable UTF-8 text already with 
fts_decoder setting by doing it via a conversion script. This could also 
support Apache Tika server directly.

10. It should be configurable which fields are indexed. Body and header would 
always be separately indexed. Optionally there could be also at least: 
attachments, From, To, Cc, Bcc and Subject. The From/To/Cc/Bcc could also be 
indexed together in one "addresses" field. The more fields there are, the 
larger the index, but better/faster search results.

11. Each indexed mail should have metadata: Mailbox GUID, mail UID and the 
language the mail was indexed with. For attachments there should also be the 
MIME part number. When matching results, drop results if returned language 
doesn't match the query language.

Squat
-----

Currently Squat index consists of a trie containing all the words and pointer 
to a file listing all the message UIDs that contain them. Each node in the trie 
has a pointer to the UIDs, so e.g. with "abc" the "a" node will contain UIDs of 
all mails that contain the "a" letter (e.g. 1,3-5,10). "ab" node will contain 
mails that have the "ab" substring. Since the "ab" is a subset of "a", the "ab" 
won't contain UIDs directly but instead it contains indexes to the "a" list to 
get a better compression (e.g. UID 3-5,10 -> 2-4 indexes in the "1,3-5,10" 
list). The "abc" node then similarly refers to the "ab" node's indexes.

It's configurable how long words Squat will index. Also substring matching is 
configurable. By default both are 4 letters, so words longer than 4 letters 
will be split to 4 letter pieces which are indexed (e.g. "dovecot" -> "dove", 
"ovec", "veco", "ecot"). When searching these pieces are looked up and the 
results are merged.

It's pretty pointless to do a search for 1-2 letter substrings. Most likely the 
user wants to find 1-2 letter word instead. Perhaps this is true also for 3 
letters? The Squat index could be changed to only add results for the first 1-2 
(or 1-3?) letters only for full words, not to word prefixes. This of course 
would mean that the "ab" referring to "a" UID list would no longer work for the 
first nodes.

Substring searching likely wouldn't work very nicely for stemmed words. So 
Squat should probably index the full stemmed word and then also index the 
unstemmed word in the small 4 letter pieces. It should be possible to also 
disable substring searching entirely.

Squat already attempts to reduce disk space by encoding the common characters 
with less bits than other characters. This is hardcoded for English language 
though. Each index compression could analyze the most common letters and 
dynamically use those. Perhaps different languages could even use separate 
Squat index files to get the most advantage of this? Although if there are a 
lot of languages it's a bit annoying to have many different index files.

Squat currently indexes each Unicode character as one letter, so there can be 
quite a huge number of different 4 letter words. This substring indexing should 
probably be disabled for words that contain letters not used by the current 
language.

The UID lists are not compressed very efficiently. See 
http://www2009.eprints.org/41/1/p401.pdf for alternatives.

There is currently one Squat index per mailbox. There should be one per user. 
This requires adding a separate ID -> { mailbox GUID, UID, language, mime part 
} mapping file.

Squat indexes should be updated in two parts. For old mails there are the large 
trie+uidlist base files in optimized format. For new mails there are small 
trie+uidlist files in a format where they are just appended to. When the trie 
or uidlist becomes large enough, it's first internally sorted/optimized, and 
then merged with the base files by recreating new optimized base files. This 
fixes the scalability problem with current large Squat indexes where with large 
unsorted indexes the sorting could have taken forever.

Since substring searches in Squat produce only "maybe matches" UID lists, the 
mails still need to be opened and searched through. This step could probably 
also do the same language detection + stemming as is done during indexing to 
improve the search results. (And this feature could be enabled even if no FTS 
index is enabled.)

Reply via email to