On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <markrmil...@gmail.com> wrote:
> Mark Miller wrote: > Okay - I guess that somewhat makes sense - you can calculate the > magnitude of the doc vectors at index time. How is that impossible with > incremental indexing though? Isn't it just expensive? Seems somewhat > expensive in the non incremental case as well - your just eating it at > index time rather than query time - though the same could be done for > incremental? The information is all there in either case. > > Ok, I think I see what you were imagining I was doing: you take the current state of the index as gospel for idf (when the index is already large, this is a good approximation), and look up these factors at index time - this means grabbing docFreq(Term) for each term in my document, and yes, this would be very expensive, I'd imagine. I've done it by pulling a monstrous (the most common 1-million terms, say) Map<String, Float> (effectively) outside of lucene entirely, which gives term idfs, and housing this in memory so that computing field norms for cosine is a very fast operation at index time. Doing it like this is hard from scratch, but is fine incrementally, because I've basically fixed idf using some previous corpus (and update the idfMap every once in a while, in cases where it doesn't change much). This has the effect of also providing a global notion of idf in a distributed corpus. -jake > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >