Re: MoreLikeThisHandler with mltipli input documents

Szűcs Roland Wed, 30 Sep 2015 01:41:15 -0700

Hello Upayavira,

We use the ajax call and it can work when it takes only some seconds (even
the 7 sec can be acceptable in this case) as the customers first focus on
the product page and if they are not satisfied with the e-book they will
need the offer. I am just started to scare what will happen if we move to
the market of English ebooks with 1 million titles. I will try the
clustering as well, or using the termvector component we can implmenet our
own more like this calculation as we realized that sometimes less than 25
interesting terms are enough to make good recommendation and it can make
the calculation faster. If you see my previous email with the intresting
terms it shows clearly that half of the terms would be enough or even less.
What a pity that there is no such a parameter for the more like this
handler: mlt.interestingtermcount which would be set 25 as a default but we
could modify it in the solrconfig to make the calculation less resource
intensive.


Thank you Upayavira and Alessandro the lots of help and effort you made. I
see the options much clearer now.

Cheers,
Roland

2015-09-30 10:23 GMT+02:00 Upayavira <u...@odoko.co.uk>:

> Could you do the MLT as a separate (AJAX) request? They appear a little
> afterwards, whilst the user is already reading the page?
>
> Or, you could do offline clustering, in which case, overnight, you
> compare every document with every other, using a (likely non-solr)
> clustering algorithm, and store those in a separate core. Then you can
> request those immediately after your search query. Or reindex your
> content with that data stored alongside.
>
> Upayavira
>
> On Wed, Sep 30, 2015, at 09:16 AM, Alessandro Benedetti wrote:
> > I am still missing why you quote the number of the documents...
> > If you have 5600 polish books, but you use the MLT only when you land in
> > the page of a specific book ...
> > I think i still miss the point !
> > MLT on 1 polish book, takes 7 secs ?
> >
> >
> > 2015-09-30 9:10 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>:
> >
> > > Hi Alessandro,
> > >
> > > You are right. I forget to mention one important factor. For 3000
> hungarian
> > > e-books the approach you mentioned is absolutely fine as the response
> time
> > > is some 0.7 sec. But when I use the same mlt for 5600 polish e-books
> the
> > > response time is 7 sec which is definetely not acceptable for the
> users.
> > >
> > > Regards,
> > > Roland
> > >
> > > 2015-09-29 17:19 GMT+02:00 Alessandro Benedetti <
> > > benedetti.ale...@gmail.com>
> > > :
> > >
> > > > Hi Roland,
> > > > you said "The main goal is that when a customer is on the pruduct
> page ".
> > > > But if you are in a  product page, I guess you have the product Id.
> > > > If you have the product id , you can simply execute the MLT request
> with
> > > > the single Doc Id in input.
> > > >
> > > > Why do you need to calculate beforehand?
> > > >
> > > > Cheers
> > > >
> > > > 2015-09-29 15:44 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu
> >:
> > > >
> > > > > Hello Upayavira,
> > > > >
> > > > > The main goal is that when a customer is on the pruduct page on an
> > > e-book
> > > > > and he does not like it somehow I want to immediately offer her/him
> > > > > alternative e-books in the same topic. If I expect from the
> customer to
> > > > > click on a button like "similar e-books" I lose half of them as
> they
> > > are
> > > > > lazy to click anywhere. So I would like to present on the product
> pages
> > > > the
> > > > > alternatives of the e-books  without clicking.
> > > > >
> > > > > I assumed the best idea to claculate the similar e-books for all
> the
> > > > other
> > > > > (n*(n-1) similarity calculation) and present only the top 5. I
> planned
> > > to
> > > > > do it when our server is not busy. In this point I found the
> > > description
> > > > of
> > > > > mlt as a search component which seemed to be a good candidate as it
> > > > > calculates the similar documents to all the result set of the
> query. So
> > > > if
> > > > > I say q=*:* and mlt component is enabled I get similar document
> for my
> > > > > entire document set. The only problem was with this approach that
> mlt
> > > > > search component does not give back the interesting terms for my
> tag
> > > > cloud
> > > > > calculation.
> > > > >
> > > > > That's why I tried to mix the flexibility of mlt compoonent
> (multiple
> > > > docs
> > > > > as an input accepted) with the robustness of MoreLikeThisHandler
> > > (having
> > > > > interesting terms).
> > > > >
> > > > > If there is no solution, I will use the mlt component and solve
> the tag
> > > > > cloud calculation other way. By the way if I am not mistaken, the
> 5.3.1
> > > > > version takes the union of the feature set of the mlt component,
> and
> > > > > handler
> > > > >
> > > > > Best Regards,
> > > > > Roland
> > > > >
> > > > >
> > > > >
> > > > > 2015-09-29 14:38 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> > > > >
> > > > > > Let's take a step back. So, you have 3000 or so docs, and you
> want to
> > > > > > know which documents are similar to these.
> > > > > >
> > > > > > Why do you want to know this? What feature do you need to build
> that
> > > > > > will use that information? Knowing this may help us to arrive at
> the
> > > > > > right technology for you.
> > > > > >
> > > > > > For example, you might want to investigate offline clustering
> > > > algorithms
> > > > > > (e.g. [1], which might be a bit dense to follow). A good book on
> > > > machine
> > > > > > learning if you are okay with Python is "Programming Collective
> > > > > > Intelligence" as it explains the usual algorithms with simple for
> > > loops
> > > > > > making it very clear.
> > > > > >
> > > > > > Or, you could do searches, and then cluster the results at search
> > > time
> > > > > > (so if you search for 100 docs, it will identify clusters within
> > > those
> > > > > > 100 matching documents). That might get you there. See [2]
> > > > > >
> > > > > > So, if you let us know what the end-goal is, perhaps we can
> suggest
> > > an
> > > > > > alternative approach, rather than burying ourselves neck-deep in
> MLT
> > > > > > problems.
> > > > > >
> > > > > > Upayavira
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > > > > > [2]
> > > https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> > > > > >
> > > > > > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > > > > > Hello Upayavira,
> > > > > > >
> > > > > > > Thanks dealing with my issue. I have applied already the
> > > > > termVectors=true
> > > > > > > to all fileds involved in the more like this calculation. I
> have
> > > > just 3
> > > > > > > 000
> > > > > > > documents each of them is represented by a relativly big term
> > > vector
> > > > > with
> > > > > > > more than 20 000 unique terms. If I run the more like this
> handler
> > > > for
> > > > > a
> > > > > > > solr doc it takes close to 1 sec to get back the first 10
> similar
> > > > > > > documents. Aftwr this I have to pass the docid-s to my other
> > > > > application
> > > > > > > which find the cover of the e-book and other metadata and put
> it on
> > > > the
> > > > > > > web. The end-to-end process takes too much time from customer
> > > > > perspective
> > > > > > > that is why I tried to find solution for offline more like this
> > > > > > > calculation. But if my app has to call the morelikethishandler
> for
> > > > each
> > > > > > > doc
> > > > > > > it puts overhead for the offline calculation.
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Roland
> > > > > > >
> > > > > > > 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> > > > > > >
> > > > > > > > If MoreLikeThis is slow for large documents that are indexed,
> > > have
> > > > > you
> > > > > > > > enabled term vectors on the similarity fields?
> > > > > > > >
> > > > > > > > Basically, what more like this does is this:
> > > > > > > >
> > > > > > > > * decide on what terms in the source doc are "interesting",
> and
> > > > pick
> > > > > > the
> > > > > > > > 25 most interesting ones
> > > > > > > > * build and execute a boolean query using these interesting
> > > terms.
> > > > > > > >
> > > > > > > > Looking at the first phase of this in more detail:
> > > > > > > >
> > > > > > > > If you pass in a document using stream.body, it will analyse
> this
> > > > > > > > document into terms, and then calculate the most interesting
> > > terms
> > > > > from
> > > > > > > > that.
> > > > > > > >
> > > > > > > > If you reference document in your index with a field that is
> > > > stored,
> > > > > it
> > > > > > > > will take the stored version, and analyse it and identify the
> > > > > > > > interesting terms from there.
> > > > > > > >
> > > > > > > > If, however, you have stored term vectors against that field,
> > > this
> > > > > work
> > > > > > > > is not needed. You have already done much of the work, and
> the
> > > > > > > > identification of your "interesting terms" will be much
> faster.
> > > > > > > >
> > > > > > > > Thus, on the content field of your documents, add
> > > > termVectors="true"
> > > > > in
> > > > > > > > your schema, and re-index. Then you could well find MLT
> becoming
> > > a
> > > > > lot
> > > > > > > > more efficient.
> > > > > > > >
> > > > > > > > Upayavira
> > > > > > > >
> > > > > > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > > > > > > Hi Alessandro,
> > > > > > > > >
> > > > > > > > > My original goal was to get offline suggestsion on content
> > > based
> > > > > > > > > similarity
> > > > > > > > > for every e-book we have . We wanted to run a bulk more
> like
> > > this
> > > > > > > > > calculation in the evening when the usage of our site is
> low
> > > and
> > > > we
> > > > > > > > > submit
> > > > > > > > > a new e-book. Real time more like this can take a while as
> we
> > > > have
> > > > > > > > > typically long documents (2-5MB text) with all the content
> > > > indexed.
> > > > > > > > >
> > > > > > > > > When we upload a new document we wanted to recalculate the
> more
> > > > > like
> > > > > > this
> > > > > > > > > suggestions and a tf-idf based tag cloouds. Both of them
> are
> > > > > > delivered by
> > > > > > > > > the More LikeThisHandler but only for one document as you
> > > wrote.
> > > > > > > > >
> > > > > > > > > The text input is not good for us because we need the
> similar
> > > doc
> > > > > > list
> > > > > > > > > for
> > > > > > > > > each of the matched document. If I put together text of 10
> > > > document
> > > > > > I can
> > > > > > > > > not separate which suggestion relates to which matched
> document
> > > > and
> > > > > > also
> > > > > > > > > the tag cloud will belong to the mixed text.
> > > > > > > > >
> > > > > > > > > Most likley we will use the MoreLikeThisHandler for each
> of the
> > > > > > documents
> > > > > > > > > and parse the json repsonse and store the result in a DQL
> > > > database
> > > > > > > > >
> > > > > > > > > Thanks your help.
> > > > > > > > >
> > > > > > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > > > > > > > <benedetti.ale...@gmail.com>
> > > > > > > > > :
> > > > > > > > >
> > > > > > > > > > Hi Roland,
> > > > > > > > > > what is your exact requirement ?
> > > > > > > > > > Do you want to basically build a "description" for a set
> of
> > > > > > documents
> > > > > > > > and
> > > > > > > > > > then find documents in the index, similar to this
> > > description ?
> > > > > > > > > >
> > > > > > > > > > By default , based on my experience ( and on the code)
> this
> > > is
> > > > > the
> > > > > > > > entry
> > > > > > > > > > point for the Lucene More Like This :
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return
> a
> > > > query
> > > > > > that
> > > > > > > > will
> > > > > > > > > > > return docs like the passed lucene document ID.**
> @param
> > > > docNum
> > > > > > the
> > > > > > > > > > > documentID of the lucene doc to generate the 'More Like
> > > This"
> > > > > > query
> > > > > > > > for.*
> > > > > > > > > > > @return a query that will return docs like the passed
> > > lucene
> > > > > > document
> > > > > > > > > > > ID.*/public Query like(int docNum) throws IOException
> {if
> > > > > > > > (fieldNames ==
> > > > > > > > > > > null) {// gather list of valid fields from
> > > > > > luceneCollection<String>
> > > > > > > > > > fields
> > > > > > > > > > > = MultiFields.getIndexedFields(ir);fieldNames =
> > > > > > fields.toArray(new
> > > > > > > > > > > String[fields.size()]);}return
> > > > > > createQuery(retrieveTerms(docNum));}*
> > > > > > > > > >
> > > > > > > > > > It means that talking about "documents" you can feed
> only one
> > > > > Solr
> > > > > > doc.
> > > > > > > > > >
> > > > > > > > > > But you can also feed the MLT with simple text.
> > > > > > > > > >
> > > > > > > > > > So you should study better your use case and understand
> which
> > > > > > option
> > > > > > > > > > fits better :
> > > > > > > > > >
> > > > > > > > > > 1) customising the MLT component starting from Lucene
> > > > > > > > > >
> > > > > > > > > > 2) doing some processing client side and use the "text"
> > > > > similarity
> > > > > > > > feature.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Cheers
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <
> > > > > > roland.sz...@bookandwalk.com
> > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Is it possible to feed multiple solr id for a
> > > > > > MoreLikeThisHandler?
> > > > > > > > > > >
> > > > > > > > > > > <requestHandler name="/mlt"
> > > class="solr.MoreLikeThisHandler">
> > > > > > > > > > > <lst name="defaults">
> > > > > > > > > > > <str name="mlt.match.include">false</str>
> > > > > > > > > > > <str name="mlt.interestingTerms">details</str>
> > > > > > > > > > > <str name="mlt.fl">title,content</str>
> > > > > > > > > > > <str name="mlt.minwl">4</str>
> > > > > > > > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > > > > > > > <str name="mlt.mintf">2</str>
> > > > > > > > > > > <int name="mlt.count">10</int>
> > > > > > > > > > > <str name="mlt.boost">true</str>
> > > > > > > > > > > <str name="wt">json</str>
> > > > > > > > > > > <str name="indent">true</str>
> > > > > > > > > > > </lst>
> > > > > > > > > > >   </requestHandler>
> > > > > > > > > > >
> > > > > > > > > > > when I call this:
> > > > > > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > > > > > > > >  it works fine. Is there any way to have a kind of
> "bulk"
> > > > call
> > > > > of
> > > > > > > > more
> > > > > > > > > > like
> > > > > > > > > > > this handler . I need the intresting terms as well and
> as
> > > far
> > > > > as
> > > > > > I
> > > > > > > > know
> > > > > > > > > > if
> > > > > > > > > > > i use more like this as a search component it does not
> > > return
> > > > > > with
> > > > > > > > it so
> > > > > > > > > > it
> > > > > > > > > > > is not an alternative.
> > > > > > > > > > >
> > > > > > > > > > > Thanks in advance,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > <
> > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > > >Roland
> > > > > > > > > > Szűcs
> > > > > > > > > > > <
> > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > > >Connect
> > > > > > > > > > with
> > > > > > > > > > > me on Linkedin <
> > > > > > > > > > >
> > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81
> > > > > 13Bookandwalk.hu
> > > > > > > > > > > <https://bokandwalk.hu/>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > --------------------------
> > > > > > > > > >
> > > > > > > > > > Benedetti Alessandro
> > > > > > > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > > > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > > > > > > >
> > > > > > > > > > "Tyger, tyger burning bright
> > > > > > > > > > In the forests of the night,
> > > > > > > > > > What immortal hand or eye
> > > > > > > > > > Could frame thy fearful symmetry?"
> > > > > > > > > >
> > > > > > > > > > William Blake - Songs of Experience -1794 England
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > >Szűcs
> > > > > > > > Roland
> > > > > > > > > <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > > > > >Ismerkedjünk
> > > > > > > > > meg a Linkedin
> > > > > > > > > <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210
> 81
> > > > > > > > > 13Bookandwalk.hu
> > > > > > > > > <https://bokandwalk.hu/>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Szűcs
> > > > > > Roland
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > > > >Ismerkedjünk
> > > > > > > meg a Linkedin
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > > > 13Bookandwalk.hu
> > > > > > > <https://bokandwalk.hu/>
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > > Roland
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > >Ismerkedjünk
> > > > > meg a Linkedin <
> > > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > 13Bookandwalk.hu
> > > > > <https://bokandwalk.hu/>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card - http://about.me/alessandro_benedetti
> > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> > >
> > >
> > > --
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> Roland
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> >Ismerkedjünk
> > > meg a Linkedin <
> > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > 13Bookandwalk.hu
> > > <https://bokandwalk.hu/>
> > >
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card - http://about.me/alessandro_benedetti
> > Blog - http://alexbenedetti.blogspot.co.uk
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
>



-- 
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
<https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
meg a Linkedin <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
-en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81 13Bookandwalk.hu
<https://bokandwalk.hu/>

Re: MoreLikeThisHandler with mltipli input documents

Reply via email to