[
https://issues.apache.org/jira/browse/SOLR-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753905#comment-16753905
]
Alessandro Benedetti commented on SOLR-12812:
---------------------------------------------
Hi [~dweiss], the Cloud MLT Query parser does a similar job.
Effectively it is executed on a Solr instance but the seed document could be in
other instances (similar to the use case you mentioned with multi cores).
The way it manages it's using the realtime GET to fetch the seed document and
then it uses the
org.apache.lucene.queries.mlt.MoreLikeThis#like(java.util.Map<java.lang.String,java.util.Collection<java.lang.Object>>).
So I guess this modification should allow to use that Lucene method given the
document in input as payload.
> Add support for arbitrary field:text pairs to streaming similarity
> calculation in MoreLikeThisHandler
> -----------------------------------------------------------------------------------------------------
>
> Key: SOLR-12812
> URL: https://issues.apache.org/jira/browse/SOLR-12812
> Project: Solr
> Issue Type: New Feature
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Dawid Weiss
> Priority: Minor
>
> In this issue I would like to add support for streaming MLT case where the
> content of the request specifies explicitly the field:text pairs to be used
> for MLT lookup.
> A longer explanation why the current solutions are not working based on a
> real use case.
> Let's say a solr instance has multiple cores (collections of documents). We'd
> like to search for similar documents between these cores. Let's assume each
> collection of documents has three fields: title, summary and abstract.
> At the moment Solr has two MLT handler options: the query-based (similarity
> to an indexed document) and the free-text based (similarity to an arbitrary
> text).
> 1) The first MLT pipeline in Solr looks for documents similar to the given
> one
> (I'll assume a single document as input, to keep things simple). This
> pipeline reads the content of the document from the existing index and creates
> a mapping between fields and actual values stored in that document.
> Let's say the document looks like this:
> title: foo bar
> summary: baz bar
> abstract: ping ping
> The "interesting term" extraction routine in MoreLikeThisHelper will extract
> those terms and
> score them against each field's statistics, then take top-N best scoring
> terms (and fields they're assigned to) and create a Boolean query from it. It
> could go something like this:
> title:foo^1.5 summary:bar^0.5
> When this query is applied against the collection it would *not* match "bar"
> in the title or abstract (because the weighted "important" term wasn't
> selected in that field). That's the way it should be.
> 2) In the second pipeline, we give the full "text" for which we wish to
> obtain similar documents. If we were to emulate scenario (1), we'd have to
> cram the content of each field into a single blob of text, so it'd become
> something like:
> foo bar, baz bar, ping ping
> Solr takes this text and creates a pseudo-document that maps the
> provided set of fields (mlt.fl) to this value. So effectively it
> creates a pseudo-document like this:
> title: foo bar, baz bar, ping ping
> summary: foo bar, baz bar, ping ping
> abstract: foo bar, baz bar, ping ping
> What follows is identical to scenario (1), but note that this time the
> set of terms for each field (and their scores) are much broader. This
> means that the final query can look like this:
> title:foo^1.5 summary:foo^0.5 title:bar^1 summary:bar^0.5
> This results in severely skewed MLT results (for example shorter fields will
> have drastically different term statistics).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]