Thanks again for the answer Ivan. Would it be simpler to modify directly in the source code the way tf is calculated? I mean replacing somewhere something like tf = sqrt(n) by tf = min(10,sqrt(n)). Cheers, Patrick
Le vendredi 21 mars 2014 18:01:51 UTC-4, Ivan Brusic a écrit : > > Term frequencies are stored within Lucene, so there is no calculating of > the value, just a lookup in the data structure. You can disable term > frequencies and then create your own in the script, but it would be easier > to calculate that value at index time so that you can access it within your > custom score and not have to iterate through all the terms yourself. Britta > has posted on the mailing list in the past, so hopefully she will reply > with some more authoritative answers, especially ones regarding performance. > > -- > Ivan > > > On Fri, Mar 21, 2014 at 11:54 AM, geantbrun <agin.p...@gmail.com<javascript:> > > wrote: > >> Thanks a lot Ivan, great answer. >> >> Suppose I use in my script my own formula for tf (with >> _index[field][term].tf()) and set the boost_mode to "replace", does >> elasticsearch calculate the tf two times or once only? In other words, is >> it computionnally efficient to calculate my own tf? Should I turn off other >> calculations made by es somewhere else to avoid double calculations? >> >> Cheers, >> Patrick >> >> Le jeudi 20 mars 2014 17:44:53 UTC-4, Ivan Brusic a écrit : >>> >>> You can provide your own similarity to be used at the field level, but >>> recent version of elasticsearch allows you to access the tf-idf values in >>> order to do custom scoring [1]. Also look at Britta's recent talk on the >>> subject [2]. >>> >>> That said, either your custom similarity or custom scoring would need >>> access to what exactly are the terms which are repeated many times. Have >>> you looked into omitting term frequencies? It would completely bypass using >>> term frequencies, which might be an overkill in your case. Look into the >>> index options [3]. >>> >>> Finally, perhaps the common terms query can help [4]. >>> >>> [1] http://www.elasticsearch.org/guide/en/elasticsearch/ >>> reference/current/modules-advanced-scripting.html >>> >>> [2] https://speakerdeck.com/elasticsearch/scoring-for-human-beings >>> >>> [3] http://www.elasticsearch.org/guide/en/elasticsearch/ >>> reference/current/mapping-core-types.html#string >>> >>> [4] http://www.elasticsearch.org/guide/en/elasticsearch/ >>> reference/current/query-dsl-common-terms-query.html >>> >>> Cheers, >>> >>> Ivan >>> >>> >>> On Thu, Mar 20, 2014 at 8:08 AM, geantbrun <agin.p...@gmail.com> wrote: >>> >>>> Hi, >>>> If I understand well, the formula used for the term frequency part in >>>> the default similarity module is the square root of the actual frequency. >>>> Is it possible to modify that formula to include something like a >>>> min(my_max_value,sqrt(frequency))? I would like to avoid huge tf's for >>>> documents that have the same term repeated many times. It seems that BM25 >>>> similarity has a parameter to control saturation but I would prefer to >>>> stick with the simple tf/idf similarity module. >>>> Thank you for your help >>>> Patrick >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearc...@googlegroups.com. >>>> >>>> To view this discussion on the web visit https://groups.google.com/d/ >>>> msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c% >>>> 40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/9a12b611-d08d-41f9-8fd4-b74ad75a6a5c%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to elasticsearc...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/64a9a877-8a97-462b-bbc2-5f2280b14d2f%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8d9dcc21-25a3-45cf-ab76-6791f1a41565%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.