Re: MoreLikeThisHandler with mltipli input documents

Upayavira Wed, 30 Sep 2015 01:24:00 -0700

Could you do the MLT as a separate (AJAX) request? They appear a little
afterwards, whilst the user is already reading the page?


Or, you could do offline clustering, in which case, overnight, you
compare every document with every other, using a (likely non-solr)
clustering algorithm, and store those in a separate core. Then you can
request those immediately after your search query. Or reindex your
content with that data stored alongside.

Upayavira

On Wed, Sep 30, 2015, at 09:16 AM, Alessandro Benedetti wrote:
> I am still missing why you quote the number of the documents...
> If you have 5600 polish books, but you use the MLT only when you land in
> the page of a specific book ...
> I think i still miss the point !
> MLT on 1 polish book, takes 7 secs ?
> 
> 
> 2015-09-30 9:10 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>:
> 
> > Hi Alessandro,
> >
> > You are right. I forget to mention one important factor. For 3000 hungarian
> > e-books the approach you mentioned is absolutely fine as the response time
> > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books the
> > response time is 7 sec which is definetely not acceptable for the users.
> >
> > Regards,
> > Roland
> >
> > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> > benedetti.ale...@gmail.com>
> > :
> >
> > > Hi Roland,
> > > you said "The main goal is that when a customer is on the pruduct page ".
> > > But if you are in a  product page, I guess you have the product Id.
> > > If you have the product id , you can simply execute the MLT request with
> > > the single Doc Id in input.
> > >
> > > Why do you need to calculate beforehand?
> > >
> > > Cheers
> > >
> > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>:
> > >
> > > > Hello Upayavira,
> > > >
> > > > The main goal is that when a customer is on the pruduct page on an
> > e-book
> > > > and he does not like it somehow I want to immediately offer her/him
> > > > alternative e-books in the same topic. If I expect from the customer to
> > > > click on a button like "similar e-books" I lose half of them as they
> > are
> > > > lazy to click anywhere. So I would like to present on the product pages
> > > the
> > > > alternatives of the e-books  without clicking.
> > > >
> > > > I assumed the best idea to claculate the similar e-books for all the
> > > other
> > > > (n*(n-1) similarity calculation) and present only the top 5. I planned
> > to
> > > > do it when our server is not busy. In this point I found the
> > description
> > > of
> > > > mlt as a search component which seemed to be a good candidate as it
> > > > calculates the similar documents to all the result set of the query. So
> > > if
> > > > I say q=*:* and mlt component is enabled I get similar document for my
> > > > entire document set. The only problem was with this approach that mlt
> > > > search component does not give back the interesting terms for my tag
> > > cloud
> > > > calculation.
> > > >
> > > > That's why I tried to mix the flexibility of mlt compoonent (multiple
> > > docs
> > > > as an input accepted) with the robustness of MoreLikeThisHandler
> > (having
> > > > interesting terms).
> > > >
> > > > If there is no solution, I will use the mlt component and solve the tag
> > > > cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> > > > version takes the union of the feature set of the mlt component, and
> > > > handler
> > > >
> > > > Best Regards,
> > > > Roland
> > > >
> > > >
> > > >
> > > > 2015-09-29 14:38 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> > > >
> > > > > Let's take a step back. So, you have 3000 or so docs, and you want to
> > > > > know which documents are similar to these.
> > > > >
> > > > > Why do you want to know this? What feature do you need to build that
> > > > > will use that information? Knowing this may help us to arrive at the
> > > > > right technology for you.
> > > > >
> > > > > For example, you might want to investigate offline clustering
> > > algorithms
> > > > > (e.g. [1], which might be a bit dense to follow). A good book on
> > > machine
> > > > > learning if you are okay with Python is "Programming Collective
> > > > > Intelligence" as it explains the usual algorithms with simple for
> > loops
> > > > > making it very clear.
> > > > >
> > > > > Or, you could do searches, and then cluster the results at search
> > time
> > > > > (so if you search for 100 docs, it will identify clusters within
> > those
> > > > > 100 matching documents). That might get you there. See [2]
> > > > >
> > > > > So, if you let us know what the end-goal is, perhaps we can suggest
> > an
> > > > > alternative approach, rather than burying ourselves neck-deep in MLT
> > > > > problems.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > > > [2]
> > https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > > > >
> > > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > > > > Hello Upayavira,
> > > > > >
> > > > > > Thanks dealing with my issue. I have applied already the
> > > > termVectors=true
> > > > > > to all fileds involved in the more like this calculation. I have
> > > just 3
> > > > > > 000
> > > > > > documents each of them is represented by a relativly big term
> > vector
> > > > with
> > > > > > more than 20 000 unique terms. If I run the more like this handler
> > > for
> > > > a
> > > > > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > > > > documents. Aftwr this I have to pass the docid-s to my other
> > > > application
> > > > > > which find the cover of the e-book and other metadata and put it on
> > > the
> > > > > > web. The end-to-end process takes too much time from customer
> > > > perspective
> > > > > > that is why I tried to find solution for offline more like this
> > > > > > calculation. But if my app has to call the morelikethishandler for
> > > each
> > > > > > doc
> > > > > > it puts overhead for the offline calculation.
> > > > > >
> > > > > > Best Regards,
> > > > > > Roland
> > > > > >
> > > > > > 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> > > > > >
> > > > > > > If MoreLikeThis is slow for large documents that are indexed,
> > have
> > > > you
> > > > > > > enabled term vectors on the similarity fields?
> > > > > > >
> > > > > > > Basically, what more like this does is this:
> > > > > > >
> > > > > > > * decide on what terms in the source doc are "interesting", and
> > > pick
> > > > > the
> > > > > > > 25 most interesting ones
> > > > > > > * build and execute a boolean query using these interesting
> > terms.
> > > > > > >
> > > > > > > Looking at the first phase of this in more detail:
> > > > > > >
> > > > > > > If you pass in a document using stream.body, it will analyse this
> > > > > > > document into terms, and then calculate the most interesting
> > terms
> > > > from
> > > > > > > that.
> > > > > > >
> > > > > > > If you reference document in your index with a field that is
> > > stored,
> > > > it
> > > > > > > will take the stored version, and analyse it and identify the
> > > > > > > interesting terms from there.
> > > > > > >
> > > > > > > If, however, you have stored term vectors against that field,
> > this
> > > > work
> > > > > > > is not needed. You have already done much of the work, and the
> > > > > > > identification of your "interesting terms" will be much faster.
> > > > > > >
> > > > > > > Thus, on the content field of your documents, add
> > > termVectors="true"
> > > > in
> > > > > > > your schema, and re-index. Then you could well find MLT becoming
> > a
> > > > lot
> > > > > > > more efficient.
> > > > > > >
> > > > > > > Upayavira
> > > > > > >
> > > > > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > > > > > Hi Alessandro,
> > > > > > > >
> > > > > > > > My original goal was to get offline suggestsion on content
> > based
> > > > > > > > similarity
> > > > > > > > for every e-book we have . We wanted to run a bulk more like
> > this
> > > > > > > > calculation in the evening when the usage of our site is low
> > and
> > > we
> > > > > > > > submit
> > > > > > > > a new e-book. Real time more like this can take a while as we
> > > have
> > > > > > > > typically long documents (2-5MB text) with all the content
> > > indexed.
> > > > > > > >
> > > > > > > > When we upload a new document we wanted to recalculate the more
> > > > like
> > > > > this
> > > > > > > > suggestions and a tf-idf based tag cloouds. Both of them are
> > > > > delivered by
> > > > > > > > the More LikeThisHandler but only for one document as you
> > wrote.
> > > > > > > >
> > > > > > > > The text input is not good for us because we need the similar
> > doc
> > > > > list
> > > > > > > > for
> > > > > > > > each of the matched document. If I put together text of 10
> > > document
> > > > > I can
> > > > > > > > not separate which suggestion relates to which matched document
> > > and
> > > > > also
> > > > > > > > the tag cloud will belong to the mixed text.
> > > > > > > >
> > > > > > > > Most likley we will use the MoreLikeThisHandler for each of the
> > > > > documents
> > > > > > > > and parse the json repsonse and store the result in a DQL
> > > database
> > > > > > > >
> > > > > > > > Thanks your help.
> > > > > > > >
> > > > > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > > > > > > <benedetti.ale...@gmail.com>
> > > > > > > > :
> > > > > > > >
> > > > > > > > > Hi Roland,
> > > > > > > > > what is your exact requirement ?
> > > > > > > > > Do you want to basically build a "description" for a set of
> > > > > documents
> > > > > > > and
> > > > > > > > > then find documents in the index, similar to this
> > description ?
> > > > > > > > >
> > > > > > > > > By default , based on my experience ( and on the code) this
> > is
> > > > the
> > > > > > > entry
> > > > > > > > > point for the Lucene More Like This :
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a
> > > query
> > > > > that
> > > > > > > will
> > > > > > > > > > return docs like the passed lucene document ID.** @param
> > > docNum
> > > > > the
> > > > > > > > > > documentID of the lucene doc to generate the 'More Like
> > This"
> > > > > query
> > > > > > > for.*
> > > > > > > > > > @return a query that will return docs like the passed
> > lucene
> > > > > document
> > > > > > > > > > ID.*/public Query like(int docNum) throws IOException {if
> > > > > > > (fieldNames ==
> > > > > > > > > > null) {// gather list of valid fields from
> > > > > luceneCollection<String>
> > > > > > > > > fields
> > > > > > > > > > = MultiFields.getIndexedFields(ir);fieldNames =
> > > > > fields.toArray(new
> > > > > > > > > > String[fields.size()]);}return
> > > > > createQuery(retrieveTerms(docNum));}*
> > > > > > > > >
> > > > > > > > > It means that talking about "documents" you can feed only one
> > > > Solr
> > > > > doc.
> > > > > > > > >
> > > > > > > > > But you can also feed the MLT with simple text.
> > > > > > > > >
> > > > > > > > > So you should study better your use case and understand which
> > > > > option
> > > > > > > > > fits better :
> > > > > > > > >
> > > > > > > > > 1) customising the MLT component starting from Lucene
> > > > > > > > >
> > > > > > > > > 2) doing some processing client side and use the "text"
> > > > similarity
> > > > > > > feature.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Cheers
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <
> > > > > roland.sz...@bookandwalk.com
> > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Is it possible to feed multiple solr id for a
> > > > > MoreLikeThisHandler?
> > > > > > > > > >
> > > > > > > > > > <requestHandler name="/mlt"
> > class="solr.MoreLikeThisHandler">
> > > > > > > > > > <lst name="defaults">
> > > > > > > > > > <str name="mlt.match.include">false</str>
> > > > > > > > > > <str name="mlt.interestingTerms">details</str>
> > > > > > > > > > <str name="mlt.fl">title,content</str>
> > > > > > > > > > <str name="mlt.minwl">4</str>
> > > > > > > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > > > > > > <str name="mlt.mintf">2</str>
> > > > > > > > > > <int name="mlt.count">10</int>
> > > > > > > > > > <str name="mlt.boost">true</str>
> > > > > > > > > > <str name="wt">json</str>
> > > > > > > > > > <str name="indent">true</str>
> > > > > > > > > > </lst>
> > > > > > > > > >   </requestHandler>
> > > > > > > > > >
> > > > > > > > > > when I call this:
> > > > > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > > > > > > >  it works fine. Is there any way to have a kind of "bulk"
> > > call
> > > > of
> > > > > > > more
> > > > > > > > > like
> > > > > > > > > > this handler . I need the intresting terms as well and as
> > far
> > > > as
> > > > > I
> > > > > > > know
> > > > > > > > > if
> > > > > > > > > > i use more like this as a search component it does not
> > return
> > > > > with
> > > > > > > it so
> > > > > > > > > it
> > > > > > > > > > is not an alternative.
> > > > > > > > > >
> > > > > > > > > > Thanks in advance,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > >Roland
> > > > > > > > > Szűcs
> > > > > > > > > > <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > >Connect
> > > > > > > > > with
> > > > > > > > > > me on Linkedin <
> > > > > > > > > >
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81
> > > > 13Bookandwalk.hu
> > > > > > > > > > <https://bokandwalk.hu/>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > --------------------------
> > > > > > > > >
> > > > > > > > > Benedetti Alessandro
> > > > > > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > > > > > >
> > > > > > > > > "Tyger, tyger burning bright
> > > > > > > > > In the forests of the night,
> > > > > > > > > What immortal hand or eye
> > > > > > > > > Could frame thy fearful symmetry?"
> > > > > > > > >
> > > > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Szűcs
> > > > > > > Roland
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > > > >Ismerkedjünk
> > > > > > > > meg a Linkedin
> > > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > > > > 13Bookandwalk.hu
> > > > > > > > <https://bokandwalk.hu/>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > > > Roland
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > >Ismerkedjünk
> > > > > > meg a Linkedin
> > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > > 13Bookandwalk.hu
> > > > > > <https://bokandwalk.hu/>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > Roland
> > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > >Ismerkedjünk
> > > > meg a Linkedin <
> > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > 13Bookandwalk.hu
> > > > <https://bokandwalk.hu/>
> > > >
> > >
> > >
> > >
> > > --
> > > --------------------------
> > >
> > > Benedetti Alessandro
> > > Visiting card - http://about.me/alessandro_benedetti
> > > Blog - http://alexbenedetti.blogspot.co.uk
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
> >
> >
> > --
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
> > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
> > meg a Linkedin <
> > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > 13Bookandwalk.hu
> > <https://bokandwalk.hu/>
> >
> 
> 
> 
> -- 
> --------------------------
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England

Re: MoreLikeThisHandler with mltipli input documents

Reply via email to