I think that as far as pure corpus analysis is concerned, LLR, min/max DF and tf-idf are about as good as you will get. TF-idf is, in fact, an approximation of LLR, so I don't even think you need to use that (and it is document centered rather than corpus centric in any case). You might get some mileage out of looking for terms that have highly variable LLR in different documents.
To get a substantial improvement over these measures, I would recommend adding new data to the mix. The new data I would look at first is some sort of user behavior history. Do you have anything like that? On Tue, Feb 16, 2010 at 10:22 AM, Drew Farris <[email protected]> wrote: > Yes, I'm using the LLR score. I was wondering if there is anything > else I should be looking at other than LLR and min/max DF. > -- Ted Dunning, CTO DeepDyve
