Hi ,

We're working with a productr where the idea is to present the users the
related documents in particular timeseries.

For an overview think about this as an application which picks up top
trending blogposts "topics" which are picked and ingested from various
social sites.
Further , when you look into the topic from the trending list it shows the
related topics which happen to happen on the blogposts.
So to mark a related topic they should have occured on a same blogpost , to
add , more are these number of occurences , more would be the relatedness
factor.

Complexity is the related topics change on the user defined date spread ,
which means if x & y were top most related topics in the blogposts made in
last 30 days ,
there is an equal possibility that x could be more related to z if the user
would have wanted to see related topics in last 60 days.
So the number of days are user defined and they impact the related topics.

For now every blogpost goes in the index as a seperate document and the
topic extraction happens alongside indexing which extracts the topics from
the blogposts and stores them in a different collection.
For this we have lot of duplicates on the index too , for e.g. a topicname
search  "football" has around 80K documents , all of them are
topicname="football".

I wonder if someone can help me :
1. How to structure the document in such a way the queries could be more
performant
2. Suggest me as to how can we detect the RELATED topics.

Any help on this would be highly appreciated.

Thanks in advance.

Atita

Reply via email to