[ 
https://issues.apache.org/jira/browse/SOLR-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753928#comment-16753928
 ] 

Dawid Weiss commented on SOLR-12812:
------------------------------------

The only difference is that with Solr you know the field-value mapping and here 
you need to provide it within the request (because the values are external to 
the index). I know it can be done in Lucene, but corresponding code needs to be 
added to the mlt handler to extract field-value mapping first. 

> Add support for arbitrary field:text pairs to streaming similarity 
> calculation in MoreLikeThisHandler
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-12812
>                 URL: https://issues.apache.org/jira/browse/SOLR-12812
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Dawid Weiss
>            Priority: Minor
>
> In this issue I would like to add support for streaming MLT case where the 
> content of the request specifies explicitly the field:text pairs to be used 
> for MLT lookup.
> A longer explanation why the current solutions are not working based on a 
> real use case. 
> Let's say a solr instance has multiple cores (collections of documents). We'd 
> like to search for similar documents between these cores. Let's assume each 
> collection of documents has three fields: title, summary and abstract.
> At the moment Solr has two MLT handler options: the query-based (similarity 
> to an indexed document) and the free-text based (similarity to an arbitrary 
> text).
> 1) The first MLT pipeline in Solr looks for documents similar to the given 
> one 
> (I'll assume a single document as input, to keep things simple). This
> pipeline reads the content of the document from the existing index and creates
> a mapping between fields and actual values stored in that document.
> Let's say the document looks like this:
> title: foo bar
> summary: baz bar
> abstract: ping ping
> The "interesting term" extraction routine in MoreLikeThisHelper will extract 
> those terms and
> score them against each field's statistics, then take top-N best scoring 
> terms (and fields they're assigned to) and create a Boolean query from it. It 
> could go something like this:
> title:foo^1.5 summary:bar^0.5
> When this query is applied against the collection it would *not* match "bar" 
> in the title or abstract (because the weighted "important" term wasn't 
> selected in that field). That's the way it should be.
> 2) In the second pipeline, we give the full "text" for which we wish to 
> obtain similar documents. If we were to emulate scenario (1), we'd have to 
> cram the content of each field into a single blob of text, so it'd become 
> something like:
> foo bar, baz bar, ping ping
> Solr takes this text and creates a pseudo-document that maps the
> provided set of fields (mlt.fl) to this value. So effectively it
> creates a pseudo-document like this:
> title: foo bar, baz bar, ping ping
> summary: foo bar, baz bar, ping ping
> abstract: foo bar, baz bar, ping ping
> What follows is identical to scenario (1), but note that this time the
> set of terms for each field (and their scores) are much broader. This
> means that the final query can look like this:
> title:foo^1.5 summary:foo^0.5 title:bar^1 summary:bar^0.5
> This results in severely skewed MLT results (for example shorter fields will 
> have drastically different term statistics).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to