Re: SweetSpotSimilarity

Robert Muir Mon, 05 Mar 2012 15:24:58 -0800

On Mon, Mar 5, 2012 at 6:01 PM, Paul Hill <p...@metajure.com> wrote:
>> I would definitely not suggest using SSS for fields like legal brief text or 
>> emails where there is huge
>> variability in the length of the content -- i can't think of any context 
>> where a "short" email is
>> definitively better/worse then a "long" email.  more traditional TF/IDF 
>> seems like it would make more
>> sense there.
>
> I was coming to a similar conclusion.
>
>> well ... hopefully the Similarity docs and the the docs on Lucene scoring 
>> have filled in most of those
>> blanks before you drill down into the specifics of how SSS work.  if not, 
>> then any concrete
>> improvements you can suggest would certainly be apprecaited...
>>
>> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/index.html
>> https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/Similarity.html
>>
>> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/build/site/scoring.html?view=co
>
> Thanks for the links.
> The first thing I notice is that what is listed at the top of Similarity is 
> totally changed.  Great stuff about the object interaction. For example, I 
> didn't understand how Weight object fit in until reading that.
> But I see I got what I asked for.  Someone thought describing the object 
> interaction was more important than the scoring formula itself.  I chew on it 
> (but I'm currently using the 3.4 code).
>
> My only thought is that the new stuff seems to be at the expense of the 
> formulas listed in the old class overview for Similarity.

Hello,

what is previously Similarity in older releases is moved to
TFIDFSimilarity: it extends Similarity and exposes a vector-space API,
with its same formulas in the javadocs:
https://builds.apache.org/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

The difference is that in 4.0, the idea is to support other scoring
models beyond the vector space model: thats why if you start looking
at other subclasses of Similarity you will find more options (e.g.
probabilistic models).

This change is described in CHANGES.txt (below). I hope its not
confusing: if you have ideas to improve the javadocs and present this
stuff better for migrating users, it would be very helpful.

* LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from
Query/Weight/Scorer. If you extended Similarity directly before, you should
extend TFIDFSimilarity instead. Similarity is now a lower-level API to
implement other scoring algorithms. See MIGRATE.txt for more details.

* LUCENE-2959: Added a variety of different relevance ranking systems to Lucene.

- Added Okapi BM25, Language Models, Divergence from Randomness, and
Information-Based Models. The models are pluggable, support all of lucene's
features (boosts, slops, explanations, etc) and queries (spans, etc).

- All models default to the same index-time norm encoding as
DefaultSimilarity, so you can easily try these out/switch back and
forth/run experiments and comparisons without reindexing. Note: most of
the models do rely upon index statistics that are new in Lucene 4.0, so
for existing 3.x indexes its a good idea to upgrade your index to the
new format with IndexUpgrader first.

- Added a new subclass SimilarityBase which provides a simplified API
for plugging in new ranking algorithms without dealing with all of the
nuances and implementation details of Lucene.

- For example, to use BM25 for all fields:
searcher.setSimilarity(new BM25Similarity());

If you instead want to apply different similarities (e.g. ones with
different parameter values or different algorithms entirely) to different
fields, implement PerFieldSimilarityWrapper with your per-field logic.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SweetSpotSimilarity

Reply via email to