If MoreLikeThis is slow for large documents that are indexed, have you
enabled term vectors on the similarity fields?

Basically, what more like this does is this:

* decide on what terms in the source doc are "interesting", and pick the
25 most interesting ones
* build and execute a boolean query using these interesting terms.

Looking at the first phase of this in more detail:

If you pass in a document using stream.body, it will analyse this
document into terms, and then calculate the most interesting terms from
that.

If you reference document in your index with a field that is stored, it
will take the stored version, and analyse it and identify the
interesting terms from there.

If, however, you have stored term vectors against that field, this work
is not needed. You have already done much of the work, and the
identification of your "interesting terms" will be much faster.

Thus, on the content field of your documents, add termVectors="true" in
your schema, and re-index. Then you could well find MLT becoming a lot
more efficient.

Upayavira

On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> Hi Alessandro,
> 
> My original goal was to get offline suggestsion on content based
> similarity
> for every e-book we have . We wanted to run a bulk more like this
> calculation in the evening when the usage of our site is low and we
> submit
> a new e-book. Real time more like this can take a while as we have
> typically long documents (2-5MB text) with all the content indexed.
> 
> When we upload a new document we wanted to recalculate the more like this
> suggestions and a tf-idf based tag cloouds. Both of them are delivered by
> the More LikeThisHandler but only for one document as you wrote.
> 
> The text input is not good for us because we need the similar doc list
> for
> each of the matched document. If I put together text of 10 document I can
> not separate which suggestion relates to which matched document and also
> the tag cloud will belong to the mixed text.
> 
> Most likley we will use the MoreLikeThisHandler for each of the documents
> and parse the json repsonse and store the result in a DQL database
> 
> Thanks your help.
> 
> 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> <benedetti.ale...@gmail.com>
> :
> 
> > Hi Roland,
> > what is your exact requirement ?
> > Do you want to basically build a "description" for a set of documents and
> > then find documents in the index, similar to this description ?
> >
> > By default , based on my experience ( and on the code) this is the entry
> > point for the Lucene More Like This :
> >
> >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query that will
> > > return docs like the passed lucene document ID.** @param docNum the
> > > documentID of the lucene doc to generate the 'More Like This" query for.*
> > > @return a query that will return docs like the passed lucene document
> > > ID.*/public Query like(int docNum) throws IOException {if (fieldNames ==
> > > null) {// gather list of valid fields from luceneCollection<String>
> > fields
> > > = MultiFields.getIndexedFields(ir);fieldNames = fields.toArray(new
> > > String[fields.size()]);}return createQuery(retrieveTerms(docNum));}*
> >
> > It means that talking about "documents" you can feed only one Solr doc.
> >
> > But you can also feed the MLT with simple text.
> >
> > So you should study better your use case and understand which option
> > fits better :
> >
> > 1) customising the MLT component starting from Lucene
> >
> > 2) doing some processing client side and use the "text" similarity feature.
> >
> >
> > Cheers
> >
> >
> > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <roland.sz...@bookandwalk.com>:
> >
> > > Hi all,
> > >
> > > Is it possible to feed multiple solr id for a MoreLikeThisHandler?
> > >
> > > <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
> > > <lst name="defaults">
> > > <str name="mlt.match.include">false</str>
> > > <str name="mlt.interestingTerms">details</str>
> > > <str name="mlt.fl">title,content</str>
> > > <str name="mlt.minwl">4</str>
> > > <str name="mlt.qf">title^12 content^1</str>
> > > <str name="mlt.mintf">2</str>
> > > <int name="mlt.count">10</int>
> > > <str name="mlt.boost">true</str>
> > > <str name="wt">json</str>
> > > <str name="indent">true</str>
> > > </lst>
> > >   </requestHandler>
> > >
> > > when I call this: http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > >  it works fine. Is there any way to have a kind of "bulk" call of more
> > like
> > > this handler . I need the intresting terms as well and as far as I know
> > if
> > > i use more like this as a search component it does not return with it so
> > it
> > > is not an alternative.
> > >
> > > Thanks in advance,
> > >
> > >
> > > --
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Roland
> > Szűcs
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Connect
> > with
> > > me on Linkedin <
> > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81 13Bookandwalk.hu
> > > <https://bokandwalk.hu/>
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
> 
> 
> 
> -- 
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
> meg a Linkedin
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> 13Bookandwalk.hu
> <https://bokandwalk.hu/>

Reply via email to