Thanks for the suggestion Anshum , appreciate your response..! I tried using MLT with the field that stores the similarity index of topics this could be related to. But this wasn't really accepted as the solution, as this could not resolve my next stage of the problem where I need to get the effective 'number of posts' where the topics which were found as related topics as deduced by MLT were found together. So I believe MLT leverages these number to orders the returned set internally. So the major challenge was to get those numbers too as they are being used on graph where these number are plotted.
I wonder if there's an alternative way to get it. Appreciate any further input on this. Thanks, Atita On Thu, Oct 26, 2017 at 11:36 PM, Anshum Gupta <ansh...@apple.com> wrote: > I would suggest you look at the mlt query parser. That allows you to find > documents similar to a particular documents, and also allows for specifying > the field to use for similarity purposes. > > https://lucene.apache.org/solr/guide/7_0/other-parsers. > html#more-like-this-query-parser > > -Anshum > > > > On Oct 26, 2017, at 1:16 AM, Atita Arora <atitaar...@gmail.com> wrote: > > Hi , > > We're working with a productr where the idea is to present the users the > related documents in particular timeseries. > > For an overview think about this as an application which picks up top > trending blogposts "topics" which are picked and ingested from various > social sites. > Further , when you look into the topic from the trending list it shows the > related topics which happen to happen on the blogposts. > So to mark a related topic they should have occured on a same blogpost , to > add , more are these number of occurences , more would be the relatedness > factor. > > Complexity is the related topics change on the user defined date spread , > which means if x & y were top most related topics in the blogposts made in > last 30 days , > there is an equal possibility that x could be more related to z if the user > would have wanted to see related topics in last 60 days. > So the number of days are user defined and they impact the related topics. > > For now every blogpost goes in the index as a seperate document and the > topic extraction happens alongside indexing which extracts the topics from > the blogposts and stores them in a different collection. > For this we have lot of duplicates on the index too , for e.g. a topicname > search "football" has around 80K documents , all of them are > topicname="football". > > I wonder if someone can help me : > 1. How to structure the document in such a way the queries could be more > performant > 2. Suggest me as to how can we detect the RELATED topics. > > Any help on this would be highly appreciated. > > Thanks in advance. > > Atita > > >