[ https://issues.apache.org/jira/browse/SOLR-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753928#comment-16753928 ]
Dawid Weiss commented on SOLR-12812: ------------------------------------ The only difference is that with Solr you know the field-value mapping and here you need to provide it within the request (because the values are external to the index). I know it can be done in Lucene, but corresponding code needs to be added to the mlt handler to extract field-value mapping first. > Add support for arbitrary field:text pairs to streaming similarity > calculation in MoreLikeThisHandler > ----------------------------------------------------------------------------------------------------- > > Key: SOLR-12812 > URL: https://issues.apache.org/jira/browse/SOLR-12812 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Dawid Weiss > Priority: Minor > > In this issue I would like to add support for streaming MLT case where the > content of the request specifies explicitly the field:text pairs to be used > for MLT lookup. > A longer explanation why the current solutions are not working based on a > real use case. > Let's say a solr instance has multiple cores (collections of documents). We'd > like to search for similar documents between these cores. Let's assume each > collection of documents has three fields: title, summary and abstract. > At the moment Solr has two MLT handler options: the query-based (similarity > to an indexed document) and the free-text based (similarity to an arbitrary > text). > 1) The first MLT pipeline in Solr looks for documents similar to the given > one > (I'll assume a single document as input, to keep things simple). This > pipeline reads the content of the document from the existing index and creates > a mapping between fields and actual values stored in that document. > Let's say the document looks like this: > title: foo bar > summary: baz bar > abstract: ping ping > The "interesting term" extraction routine in MoreLikeThisHelper will extract > those terms and > score them against each field's statistics, then take top-N best scoring > terms (and fields they're assigned to) and create a Boolean query from it. It > could go something like this: > title:foo^1.5 summary:bar^0.5 > When this query is applied against the collection it would *not* match "bar" > in the title or abstract (because the weighted "important" term wasn't > selected in that field). That's the way it should be. > 2) In the second pipeline, we give the full "text" for which we wish to > obtain similar documents. If we were to emulate scenario (1), we'd have to > cram the content of each field into a single blob of text, so it'd become > something like: > foo bar, baz bar, ping ping > Solr takes this text and creates a pseudo-document that maps the > provided set of fields (mlt.fl) to this value. So effectively it > creates a pseudo-document like this: > title: foo bar, baz bar, ping ping > summary: foo bar, baz bar, ping ping > abstract: foo bar, baz bar, ping ping > What follows is identical to scenario (1), but note that this time the > set of terms for each field (and their scores) are much broader. This > means that the final query can look like this: > title:foo^1.5 summary:foo^0.5 title:bar^1 summary:bar^0.5 > This results in severely skewed MLT results (for example shorter fields will > have drastically different term statistics). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org