RE: SweetSpotSimilarity

Paul Hill Thu, 01 Mar 2012 10:41:06 -0800

HI Chris,

 I didn't see your response.  Thanks.
Actually I was recently playing in fooplot , an online plotting tool (one of 
many), to examine the various formulas and getting a better handle on what they 
do.


Thanks for the discussion of 'sweetspot'.  I'm thinking this might help others 
going forward who come across sweet spot and wonder what that is all about.
 So sweet spot is the range beyond which things get too "junky".  Interesting!  
 Now I'll have to get my head around that idea not for fields like your 
(product) descriptions, but actual documents written by users
(aka legal documents).  There ARE ridiculous examples in legal documents -- 
things like giant long-running class action lawsuits that when printed are 
measured in meters.  But maybe the upper tail would not drop off as fast as 
your product "description" field example or maybe sweet spot is not really a 
sensible idea for body fields which run from very small to occasionally very 
large.  It also might be the case that cover letters and e-mails while short 
might not be really something to heavily discount.  The lower discount range 
can be ignored by setting the min of any sweet spot to 1.  Then one starts to 
wonder if there is really is any level area.

It is hard to put it all together, but I do appreciate the fact that all 
(nearly all?) of the scoring formula is contained in the class Similarity, but 
that presents its own interesting problem.
When I get that deep in the code the issue is not simply the shape of the 
equation, but issues like how tweaking any parameters effects the overall 
document scores.  For example, consider the comments about "steepness" related 
to length norm.  It talks (some) mathematics of the equation, but until one 
spends some time with that equation and understanding where they all fit 
together, I doubt it jumps out at most folks what large or smaller values mean 
for terms and resulting document scores.

One obvious hard to tease out part of the Similarity API is when each part is 
called -- the simplest being index time vs. search time -- there is some clues, 
but when a coder using any such interesting override is looking at a method 
that contains the actual  equation, it is hard to put it all back together if 
one has just "spelunked" down all kinds of interesting "twisty little passages 
all the same" passed Weight, Scorer and all its friends, passing calls to 
deprecated APIs (3.4) to get to an actual formula.  It is also not easy for the 
API documenter like you, because obviously, while there is normal place any bit 
of the equation comes into the overall scoring formula there really is no 
guarantee some variation of all the related classes will call things in the 
normal manner, So I understand your challenge.  Now in everyone's defense (and 
for readers of this discussion), some of the best documentation for a bit 
larger picture is the abstract class Similarity even though it contains no 
formulas.

If I get this all figured out myself, maybe I'll submit a talk "changing 
document relevancy for newbies" or "What happens if I pull THIS lever?" :-)

The following is one variation of a plot of computeLengthNorm as shown in 
fooplot

http://fooplot.com/index.php?&type0=0&type1=0&type2=0&type3=0&type4=0&y0=&y1=0.1&y2=%281.0%20/%20sqrt%28%280.5*%28abs%28x-100%29%20%2B%20abs%28x%20-%2050000%29%20-%20%2850000-100%29%29%29%2B%201.0%29%29&y3=&y4=&r0=&r1=&r2=&r3=&r4=&px0=&px1=&px2=&px3=&px4=&py0=&py1=&py2=&py3=&py4=&smin0=0&smin1=0&smin2=0&smin3=0&smin4=0&smax0=2pi&smax1=2pi&smax2=2pi&smax3=2pi&smax4=2pi&thetamin0=0&thetamin1=0&thetamin2=0&thetamin3=0&thetamin4=0&thetamax0=2pi&thetamax1=2pi&thetamax2=2pi&thetamax3=2pi&thetamax4=2pi&ipw=1&ixmin=-50&ixmax=150&iymin=-0.5&iymax=1.5&igx=10&igy=0.25&igl=1&igs=1&iax=0&ila=1&xmin=-50&xmax=150&ymin=-0.5&ymax=1.5

It is hard to say where the best play to place graphs and any such helpful 
discussion; on-line or in the source tree.

-Paul

> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Tuesday, February 28, 2012 3:15 PM
> To: java-user@lucene.apache.org
> Subject: RE: SweetSpotSimilarity
> 
> 
> : A picture -- or more precisely a graph -- would be worth a 1000 words.
> 
> fair enough.  I think the reason i never committed one initially was because 
> the formula in the
> javadocs was trivial to plot in gnuplot...
> 
> gnuplot> min=0
> gnuplot> max=2
> gnuplot> base=1.3
> gnuplot> xoffset=10
> gnuplot> set yrange [0:3]
> gnuplot> set xrange [0:20]
> gnuplot> tf(x)=min+(max-min)/2*(((base**(x-xoffset)-base**-(x-xoffset))/
> gnuplot> (base**(x-xoffset)+base**-(x-xoffset)))+1)
> gnuplot> plot tf(x)
> 
> i'll try to get some graphs commited and linked to from the javadocs that 
> make it more clear how
> tweaking the settings affect the formula
> 
> : Another problem mentioned in the e-mail thread Chris linked is "people
> : who know the 'sweetspot' of their data.", but I have yet to find a
> : definition of what is meant by "sweetspot", so I couldn't say whether I
> : know my data's sweet spot or not.
> 
> hmmm... sorry, i kind of just always took it s self evident.  i'm not even 
> sure how to define it ... the
> sweetspot is "the sweetspot" ... the range of good values such that things 
> not in the sweetspot are
> atypical and "less good"
> 
> To give a practical example: when i was working with product data we found 
> that the sweetspot for
> the length of a product name was between 4 and 10 terms.  products with less 
> then 4 terms in the
> name field were usually junk products (ie: "ram" or "mouse") and products 
> with more then 10 terms in
> the name were usually junk products that had keyword stuffing going on.
> 
> likewise we determined that for fields like the "product description" the 
> sweetspot for tf matching
> was arround 1-5 (if i remember correctly) ...
> because no one term appeared in a "well written" product description more 
> then 5 times -- any more
> then that was keyword spamming.
> 
> every catalog of products is going to be different, and every domain is going 
> to be *much* different
> (ie: if you search books, or encyclopedia articles then the sweetspots are 
> going to be much larger)
> 
> : Another question is how the tf_hyper_offset parameter might be
> : considered.  It appears to be the inflexion point of the tanh equation,
> : but what term count might a caller consider centering there ( or
> 
> right ... it's the center of your sweetspot if you use hyperbolicTf, you use 
> the value that makes sense
> for your data.
> 
> : I also note that the JavaDoc says that the default tf_hyper_base ("the
> : base value to be used in the exponential for the hyperbolic function ")
> : value is e. But checking the code the default is actually 1.3 (less than
> : half e).  Should I file a doc bug?
> 
> I'll fix that (if i remember correctly, "e" is the canonical value typically 
> used in doing hyperbolics for
> some reason, but for tf purposes made for a curve thta was too steep to be 
> generally useful by
> default so we changed it as soon as it was committed) ... thanks for pointing 
> out the doc mistake.
> 
> 
> -Hoss
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: SweetSpotSimilarity

Reply via email to