I’m using Dovecot FTS with the flatcurve backend in a mailcow: dockerized setup. When searching for an email address with a hyphenated local-part (e.g., [email protected]), the email-address tokenizer splits the local-part on hyphens, producing tokens like ma, g, and [email protected]. This prevents searching for ma-g as a single term. With fts_flatcurve_substring_search = yes, searching for ma-g matches unrelated addresses containing ma (e.g., [email protected]), leading to irrelevant results. Dovecot version: 2.3.21.1 (d492236fa0) Including only relevant part of dovecot config: plugin { fts = flatcurve fts_autoindex = yes fts_autoindex_exclude = \Junk fts_autoindex_exclude2 = \Trash fts_autoindex_max_recent_msgs = 999999 fts_tokenizers = generic email-address fts_tokenizer_email_address = maxlen=100 fts_tokenizer_generic = algorithm=simple maxlen=100 fts_flatcurve_substring_search = yes fts_languages = en es de ru fts_filters = normalizer-icu snowball stopwords fts_filters_en = lowercase snowball english-possessive stopwords fts_filters_ru = lowercase snowball stopwords fts_index_timeout = 300s } service indexer-worker { process_limit = 12 vsz_limit = 512 MB }
Steps to Reproduce: Index an email with [email protected] in the From field also index email contains "ma" and "g" in the From field. Check tokenization: doveadm fts tokenize -u [email protected] "[email protected]" Output: ma g example com [email protected] Search: doveadm search -u [email protected] FROM ma-g Results include [email protected] due to ma matching. Expected Behavior: FROM ma-g should match only emails with [email protected], treating ma-g as a single term or exact local-part. Expected tokens: doveadm fts tokenize -u [email protected] "[email protected]" Output: ma-g ma g example com [email protected] Actual Behavior: The tokenizer splits ma-g into ma and g. Substring search matches "ma" or "g" in unrelated addresses (e.g., [email protected], [email protected]). Without substring search, ma-g matches nothing unless searching the full address. Impact: Searching hyphenated local-parts for short email address local-parts is unreliable, especially for common fragments like ma, flooding results with irrelevant matches. Request: Add a configuration option, such as "fts_tokenizer_email_address_keep_hyphenated = yes|no" (default: no, for compatibility), to include the hyphenated local-part of an email address as an additional token. For example, with "yes", tokenizing "[email protected]" would produce "ma-g", "ma", "g", "example", "com", and "[email protected]". This allows searches for "FROM ma-g" to match emails with "[email protected]" exactly, while preserving "ma" and "g" for substring searches. Consider "yes" as a future default, as including hyphenated local-parts aligns with RFC 5322 and user expectations for precise email searches, especially for common hyphenated addresses like "[email protected]". If changing defaults, provide upgrade notes for users relying on the current token set. Is there any workaround to search hyphenated local-parts accurately? Best regards, Daniel Levin _______________________________________________ dovecot mailing list -- [email protected] To unsubscribe send an email to [email protected]
