[ https://issues.apache.org/jira/browse/LUCENE-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213942#comment-15213942 ]
ASF subversion and git services commented on LUCENE-6954: --------------------------------------------------------- Commit e8dac9bfdf358fff3b484ed5cd9032c1fe4bae96 in lucene-solr's branch refs/heads/master from [~teofili] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e8dac9b ] LUCENE-6954 - keep info about relationship between fields and terms when retrieving terms in MLT > More Like This Query: keep fields separated > ------------------------------------------- > > Key: LUCENE-6954 > URL: https://issues.apache.org/jira/browse/LUCENE-6954 > Project: Lucene - Core > Issue Type: Bug > Components: modules/other > Affects Versions: 5.4 > Reporter: Alessandro Benedetti > Assignee: Tommaso Teofili > Labels: morelikethis > Attachments: LUCENE-6954.patch > > > Currently the query is generated : > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int) > 1) we extract the terms from the interesting fields, adding them to a map : > Map<String, Int> termFreqMap = new HashMap<>(); > ( we lose the relation field-> term, we don't know anymore where the term was > coming ! ) > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue > 2) we build the queue that will contain the query terms, at this point we > connect again there terms to some field, but : > ... > // go through all the fields and find the largest document frequency > String topField = fieldNames[0]; > int docFreq = 0; > for (String fieldName : fieldNames) { > int freq = ir.docFreq(new Term(fieldName, word)); > topField = (freq > docFreq) ? fieldName : topField; > docFreq = (freq > docFreq) ? freq : docFreq; > } > ... > We identify the topField as the field with the highest document frequency for > the term t . > Then we build the termQuery : > queue.add(new ScoreTerm(word, topField, score, idf, docFreq, tf)); > In this way we lose a lot of precision. > Not sure why we do that. > I would prefer to keep the relation between terms and fields. > The MLT query can improve a lot the quality. > If i run the MLT on 2 fields : weSell and weDontSell for example. > It is likely I want to find documents with similar terms in the weSell and > similar terms in the weDontSell, without mixing up the things and loosing the > semantic of the terms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org