I’m using Dovecot FTS with the flatcurve backend in a mailcow: dockerized setup.
When searching for an email address with a hyphenated local-part (e.g., 
[email protected]), the email-address tokenizer splits the local-part on 
hyphens, producing tokens like ma, g, and [email protected]. This prevents 
searching for ma-g as a single term. With fts_flatcurve_substring_search = yes, 
searching for ma-g matches unrelated addresses containing ma (e.g., 
[email protected]), leading to irrelevant results.
Dovecot version: 2.3.21.1 (d492236fa0)
Including only relevant part of dovecot config:
plugin {  fts = flatcurve
    fts_autoindex = yes
    fts_autoindex_exclude = \Junk
    fts_autoindex_exclude2 = \Trash
    fts_autoindex_max_recent_msgs = 999999
    fts_tokenizers = generic email-address
    fts_tokenizer_email_address = maxlen=100
    fts_tokenizer_generic = algorithm=simple maxlen=100
    fts_flatcurve_substring_search = yes
    fts_languages = en es de ru
    fts_filters = normalizer-icu snowball stopwords
    fts_filters_en = lowercase snowball english-possessive stopwords
    fts_filters_ru = lowercase snowball stopwords
    fts_index_timeout = 300s
}
service indexer-worker {
    process_limit = 12
    vsz_limit = 512 MB
}



Steps to Reproduce:
Index an email with [email protected] in the From field also index email 
contains "ma" and "g" in the From field.

Check tokenization:

doveadm fts tokenize -u [email protected] "[email protected]"

Output:

ma
g
example
com
[email protected]

Search:

doveadm search -u [email protected] FROM ma-g

Results include [email protected] due to ma matching.

Expected Behavior:
FROM ma-g should match only emails with [email protected], treating ma-g as a 
single term or exact local-part.
Expected tokens:
doveadm fts tokenize -u [email protected] "[email protected]"

Output:
ma-g
ma
g
example
com
[email protected]
Actual Behavior:
The tokenizer splits ma-g into ma and g. Substring search matches "ma" or "g" 
in unrelated addresses (e.g., [email protected], [email protected]). Without 
substring search, ma-g matches nothing unless searching the full address.
Impact:
Searching hyphenated local-parts for short email address local-parts is 
unreliable, especially for common fragments like ma, flooding results with 
irrelevant matches.
Request:
Add a configuration option, such as 
"fts_tokenizer_email_address_keep_hyphenated = yes|no" (default: no, for 
compatibility), to include the hyphenated local-part of an email address as an 
additional token. For example, with "yes", tokenizing "[email protected]" would 
produce "ma-g", "ma", "g", "example", "com", and "[email protected]". This 
allows searches for "FROM ma-g" to match emails with "[email protected]" 
exactly, while preserving "ma" and "g" for substring searches. Consider "yes" 
as a future default, as including hyphenated local-parts aligns with RFC 5322 
and user expectations for precise email searches, especially for common 
hyphenated addresses like "[email protected]". If changing defaults, 
provide upgrade notes for users relying on the current token set.

Is there any workaround to search hyphenated local-parts accurately?

Best regards,
Daniel Levin




_______________________________________________
dovecot mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to