On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill <p...@metajure.com> wrote: >> I would definitely not suggest using SSS for fields like legal brief text or >> emails where there is huge >> variability in the length of the content -- i can't think of any context >> where a "short" email is >> definitively better/worse then a "long" email. more traditional TF/IDF >> seems like it would make more >> sense there. > > I was coming to a similar conclusion. > >> well ... hopefully the Similarity docs and the the docs on Lucene scoring >> have filled in most of those >> blanks before you drill down into the specifics of how SSS work. if not, >> then any concrete >> improvements you can suggest would certainly be apprecaited... >> >> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/index.html >> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/Similarity.html >> >> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/build/site/scoring.html?view=co > > Thanks for the links. > The first thing I notice is that what is listed at the top of Similarity is > totally changed. Great stuff about the object interaction. For example, I > didn't understand how Weight object fit in until reading that. > But I see I got what I asked for. Someone thought describing the object > interaction was more important than the scoring formula itself. I chew on it > (but I'm currently using the 3.4 code). > > My only thought is that the new stuff seems to be at the expense of the > formulas listed in the old class overview for Similarity.
Hello, what is previously Similarity in older releases is moved to TFIDFSimilarity: it extends Similarity and exposes a vector-space API, with its same formulas in the javadocs: https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html The difference is that in 4.0, the idea is to support other scoring models beyond the vector space model: thats why if you start looking at other subclasses of Similarity you will find more options (e.g. probabilistic models). This change is described in CHANGES.txt (below). I hope its not confusing: if you have ideas to improve the javadocs and present this stuff better for migrating users, it would be very helpful. * LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from Query/Weight/Scorer. If you extended Similarity directly before, you should extend TFIDFSimilarity instead. Similarity is now a lower-level API to implement other scoring algorithms. See MIGRATE.txt for more details. * LUCENE-2959: Added a variety of different relevance ranking systems to Lucene. - Added Okapi BM25, Language Models, Divergence from Randomness, and Information-Based Models. The models are pluggable, support all of lucene's features (boosts, slops, explanations, etc) and queries (spans, etc). - All models default to the same index-time norm encoding as DefaultSimilarity, so you can easily try these out/switch back and forth/run experiments and comparisons without reindexing. Note: most of the models do rely upon index statistics that are new in Lucene 4.0, so for existing 3.x indexes its a good idea to upgrade your index to the new format with IndexUpgrader first. - Added a new subclass SimilarityBase which provides a simplified API for plugging in new ranking algorithms without dealing with all of the nuances and implementation details of Lucene. - For example, to use BM25 for all fields: searcher.setSimilarity(new BM25Similarity()); If you instead want to apply different similarities (e.g. ones with different parameter values or different algorithms entirely) to different fields, implement PerFieldSimilarityWrapper with your per-field logic. -- lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org