[
https://issues.apache.org/jira/browse/SOLR-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753928#comment-16753928
]
Dawid Weiss commented on SOLR-12812:
------------------------------------
The only difference is that with Solr you know the field-value mapping and here
you need to provide it within the request (because the values are external to
the index). I know it can be done in Lucene, but corresponding code needs to be
added to the mlt handler to extract field-value mapping first.
> Add support for arbitrary field:text pairs to streaming similarity
> calculation in MoreLikeThisHandler
> -----------------------------------------------------------------------------------------------------
>
> Key: SOLR-12812
> URL: https://issues.apache.org/jira/browse/SOLR-12812
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Dawid Weiss
> Priority: Minor
>
> In this issue I would like to add support for streaming MLT case where the
> content of the request specifies explicitly the field:text pairs to be used
> for MLT lookup.
> A longer explanation why the current solutions are not working based on a
> real use case.
> Let's say a solr instance has multiple cores (collections of documents). We'd
> like to search for similar documents between these cores. Let's assume each
> collection of documents has three fields: title, summary and abstract.
> At the moment Solr has two MLT handler options: the query-based (similarity
> to an indexed document) and the free-text based (similarity to an arbitrary
> text).
> 1) The first MLT pipeline in Solr looks for documents similar to the given
> one
> (I'll assume a single document as input, to keep things simple). This
> pipeline reads the content of the document from the existing index and creates
> a mapping between fields and actual values stored in that document.
> Let's say the document looks like this:
> title: foo bar
> summary: baz bar
> abstract: ping ping
> The "interesting term" extraction routine in MoreLikeThisHelper will extract
> those terms and
> score them against each field's statistics, then take top-N best scoring
> terms (and fields they're assigned to) and create a Boolean query from it. It
> could go something like this:
> title:foo^1.5 summary:bar^0.5
> When this query is applied against the collection it would *not* match "bar"
> in the title or abstract (because the weighted "important" term wasn't
> selected in that field). That's the way it should be.
> 2) In the second pipeline, we give the full "text" for which we wish to
> obtain similar documents. If we were to emulate scenario (1), we'd have to
> cram the content of each field into a single blob of text, so it'd become
> something like:
> foo bar, baz bar, ping ping
> Solr takes this text and creates a pseudo-document that maps the
> provided set of fields (mlt.fl) to this value. So effectively it
> creates a pseudo-document like this:
> title: foo bar, baz bar, ping ping
> summary: foo bar, baz bar, ping ping
> abstract: foo bar, baz bar, ping ping
> What follows is identical to scenario (1), but note that this time the
> set of terms for each field (and their scores) are much broader. This
> means that the final query can look like this:
> title:foo^1.5 summary:foo^0.5 title:bar^1 summary:bar^0.5
> This results in severely skewed MLT results (for example shorter fields will
> have drastically different term statistics).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]