Re: Inaccurate results while searching for a phrase in subject (fts-flatcurve)

2023-05-26 Thread ss17
Thanks Michael for that explanation. So with the addition of tokenization has 
Dovecot lost the ability to search phrases, irrespective of FTS engine. That 
would be a real bummer if true.
___
dovecot mailing list -- dovecot@dovecot.org
To unsubscribe send an email to dovecot-le...@dovecot.org


Re: Inaccurate results while searching for a phrase in subject (fts-flatcurve)

2023-05-25 Thread Michael Slusarz via dovecot
See below.

> On 05/23/2023 2:14 AM MDT s...@fea.st wrote:
> 
> I had been using the lucene FTS plugin since a decade now and it has done me 
> well. Thought of upgrading to the new & current stuff and came across the 
> flatcurve plugin which seems very promising (xapian on the other hand was 
> creating indexes larger than my mailboxes themselves). I am using following 
> configuration in dovecot.conf:
> 
> fts = flatcurve
> fts_filters_en = lowercase english-possessive stopwords
> fts_languages = en
> fts_tokenizers = generic email-address

^^^ FTS input is being tokenized, so the phrase "/home/johndoe/render.php" will 
be indexed not as a full string but instead separately as "home", "johndoe", 
"render", and "php".

See: 
https://doc.dovecot.org/settings/plugin/fts-plugin/#plugin_setting-fts-fts_tokenizers

This has nothing to do with flatcurve (or any FTS driver) - Dovecot will never 
send the full "/home/johndoe/render.php" to the driver to be indexed.


> fts_autoindex = no
> fts_enforced = yes
> 
> A search command like this:
> 
> doveadm -D search -u j...@doe.com mailbox INBOX SUBJECT 
> "/home/johndoe/render.php"
> 
> should show the messages with subject: "CRON: /home/johndoe/render.php OK" 
> but produces a lot of extra undesired results and I think the second line in 
> this debug output indicates the reason:
> 
> May 23 07:44:13 doveadm(j...@doe.com): Debug: fts-flatcurve(INBOX): Query 
> (hdr_subject:/home/johndoe/render.php*) matches=0 uids=

This is correct, since "/home/johndoe/render.php" was not indexed so there 
should be zero results.


> May 23 07:44:13 doveadm(j...@doe.com): Debug: fts-flatcurve(INBOX): Query 
> (hdr_subject:php* AND hdr_subject:render* AND hdr_subject:johndoe* AND 
> hdr_subject:home*) matches=272 

And this is also correct, as the search phrase is attempted by searching both 
its full string and also all of its tokenized components.  (Both the original 
text and all search terms are processed through the tokenizer before passing to 
a FTS driver.)


> I tried rebuilding the indexes with "fts_flatcurve_substring_search = yes" 
> too but that didn't change anything. It works as expected with lucene plugin 
> because in that case header search is performed via dovecot indexes instead 
> of FTS. May be I am not doing something right in configuring this new FTS? 

I'm not a lucene expert... but with the old lucene plugin, you were almost 
certainly using it without Dovecot tokenization support, since the plugin 
predates it (I think) - using Dovecot tokenization would have required 
'use_libfts' to be present in the fts_lucene setting (which I doubt was ever 
documented).  I believe Dovecot was just doing simple white-space tokenization 
instead, so lucene code/library was likely receiving the full string and doing 
internal tokenization.

michael
___
dovecot mailing list -- dovecot@dovecot.org
To unsubscribe send an email to dovecot-le...@dovecot.org


Inaccurate results while searching for a phrase in subject (fts-flatcurve)

2023-05-23 Thread ss17
Hi,

I had been using the lucene FTS plugin since a decade now and it has done me 
well. Thought of upgrading to the new & current stuff and came across the 
flatcurve plugin which seems very promising (xapian on the other hand was 
creating indexes larger than my mailboxes themselves). I am using following 
configuration in dovecot.conf:

fts = flatcurve
fts_filters_en = lowercase english-possessive stopwords
fts_languages = en
fts_tokenizers = generic email-address
fts_autoindex = no
fts_enforced = yes

A search command like this:

doveadm -D search -u j...@doe.com mailbox INBOX SUBJECT 
"/home/johndoe/render.php"

should show the messages with subject: "CRON: /home/johndoe/render.php OK" but 
produces a lot of extra undesired results and I think the second line in this 
debug output indicates the reason:

May 23 07:44:13 doveadm(j...@doe.com): Debug: fts-flatcurve(INBOX): Query 
(hdr_subject:/home/johndoe/render.php*) matches=0 uids=
May 23 07:44:13 doveadm(j...@doe.com): Debug: fts-flatcurve(INBOX): Query 
(hdr_subject:php* AND hdr_subject:render* AND hdr_subject:johndoe* AND 
hdr_subject:home*) matches=272 
uids=67041,67085,67188,67223,67257,67290,67323,67355,67395,67564,67770,67817,67863,67985,68819,69512,69572,69635,69737,70017,70058,70086,70125,70147,70191,70296,70304,70331,70340,70350,70354,70375,70407,70417,70427,70449,70499,70521:70522,70535:70550,70555,70561:70563,70591,70597:70599,70662,70685,70702,70708,70718:70719,70724,70727:70728,70730:70733,70735,70746:70747,70754,70775,70777,70794,70811:70812,70822,70866,70942,70948,70971,71017,71021,71040,71042,71075,71079,71084,71113,71128:71129,71131,71152,71160,71184,71188,71208,71214,71225,71255,71269,71297,71300,71331,71375,71422,71449,71457,71467,71469,71495,71515,71605,71626,71632,71649,71672,71681:71682,71689,71692,71699,71716,71757,71770,71777,71782:71785,71790,71795,71797,71814,71818:71819,71828,71838:71842,71845,71859:71860,71937,71947,71954,71960,71963:7
 
1964,71977,71990,72014,72021:72022,72030,72034:72042,72045:72046,72049,72056,72061,72063,72073:72074,72083,72088,72090,72092,72101,72108,72129,72131:72132,72134,72136:72140,72159,72163,72172:72173,72186,72212,72218:72223,72237,72239,72246,72267,72288,72387,72410,72446,72469,72476:72477,72514,72541,72543,72568:72569,72572:72574,72598,72604,72606,72609,72644,72674,72687,72691,72694,72734,72772,72791,72797,72799,72803,72832:72833,72835:72841,72856:72857,72866:72867,72873:72874,72901,72930,72938,72948,72960,72965,72976,73018,73037,73071,73081,73116,73158,73249,73307,73352,73392,73466,73533,73601,73670,73733,73775,73784:73786,73804,73807,73811,73815,73819,73823,73825,73831,73842,73846,74005,74199,74390,74540,74684,74854,75017,75192,75354,75525,75710,75839:75843,75845,75903,75984:75985,76091,76263,76447,76624,76816,76989,77091:77092,77097,77119,77155,77293,77460,77608,77761,77908,78066,78218,78393,78400:78401,78522:78523,78560,78728,78921,79104,79298,79504,79555,79898,80027,80031:80032,80
 034:80035,80037,80056,80071,80073,80077:80079,80082:80084,80086,80089

I tried rebuilding the indexes with "fts_flatcurve_substring_search = yes" too 
but that didn't change anything. It works as expected with lucene plugin 
because in that case header search is performed via dovecot indexes instead of 
FTS. May be I am not doing something right in configuring this new FTS? Will 
really appreciate some pointers here.

Thanks,
Sam
___
dovecot mailing list -- dovecot@dovecot.org
To unsubscribe send an email to dovecot-le...@dovecot.org