Yes, it will be too much to do in real time, but it is a good idea tough. I don't know if a vector of term frequencies is stored with the document. Because I could search on the index to get the subset of documents and then take the term frequencies from there. In that case I could change MoreLikeThis to receive a set of term frequencies, instead of an IndexReader, and use that to do all the process.
Anyone knows if a document contains for his fields the term frequencies? On Wed, Apr 23, 2008 at 7:46 AM, Karl Wettin <[EMAIL PROTECTED]> wrote: > Jonathan Ariel skrev: > > > Smart idea, but it won't help me. I have almost 50 categories and > > eventually > > I would like to "filter" not just on category but maybe also on > > language, > > etc. > > Karl: what do you mean by measure the distance between the term vectors > > and > > cluster them in real time? > > > > I mean exactly what I say, that if your subsets are small enough you could > evalute the cosine coefficient and group documents accordingly. > > 2 million documents is however way to much data to do that in real time. > > I would probably create one index for each "filter" you want to use. > > > karl > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >