I think I see what you are after. I'm after the same knowledge. :) The only things that I can recommend are books: Modern Information Retrieval Managing Gigabytes
And online resources like: http://finance.groups.yahoo.com/group/mg/ (note the weird host name) http://www.sims.berkeley.edu/~hearst/irbook/ There is a pile of stuff in Citeseer, but those papers never really dig into the details and always require solid previous knowledge of the field. They are no replacement for a class or a textbook. If you find a good forum for IR, please share. Otis --- Kelvin Tan <[EMAIL PROTECTED]> wrote: > Wouldn't it be great if we can form a study-group of Lucene folks who > want to take the "next step"? I feel uneasy posting non-Lucene > specific questions to dev or user even if its related to IR. > > Feels to me like there could be a couple like us, who didn't do a > dissertation in IR, but would like a more indepth knowledge for > practical purposes. Basically, the end result is that we are able to > tune or extend lucene by using the Expert api (classes marked as > Expert). Perhaps a possible outcome is a tuning tutorial for advanced > users who already know how to use Lucene. > > What do you think? > > k > > On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote: > > Exactly. Luckily, since then I've learned a bit from lucene-dev > > discussions and side IR readings, so some of the topics are making > > more sense now. > > > > Otis > > > > --- Kelvin Tan <[EMAIL PROTECTED]> wrote: > > > >> Hi Otis, I was re-reading this whole theoretical thread about > >> idf, scoring, normalization, etc from last Oct and couldn't help > >> laughing out loud when I read your post, coz it summed up what I > >> was thinking the whole time. I think its really great to have > >> people like Chuck and Paul (Eshlot) around. I'm learning so much. > >> > >> k > >> > >> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote: > >> > >>> Hi Chuck, > >>> > >>> The relative lack of responses is not because there is no > >>> interest, but probably because there are only a few people on > >>> lucene-dev who can follow/understand every detail of your > >>> proposal. I understand and hear you, but I have a hard time > >>> 'visualizing' some of the formulas in your proposal. What you > >>> are saying sounds right to me, but I don't have enough > >>> theoretical knowledge to go one way or the other. > >>> > >>> Otis > >>> > >>> > >>> --- Chuck Williams <[EMAIL PROTECTED]> wrote: > >>> > >>>> Hi everybody, > >>>> > >>>> Although there doesn't seem to be much interest in this I > >>>> have one significant improvement to the below and a couple > >>>> observations that clarify the situation. > >>>> > >>>> To illustrate the problem better normalization is intended to > >>>> address, > >>>> in my current application for BooleanQuery's of two terms, I > >>>> frequently > >>>> get a top score of 1.0 when only one of the terms is matched. > >>>> 1.0 should indicate a "perfect match". I'd like set my UI up > >>>> to present the > >>>> results differently depending on how good the different > >>>> results are (e.g., showing a visual indication of result > >>>> quality, dropping the really bad results entirely, and > >>>> segregating the good results from other > >>>> only vaguely relevant results). To build this kind of > >>>> "intelligence" into the UI, I certainly need to know whether > >>>> my top result matched all > >>>> query terms, or only half of them. I'd like to have the > >>>> score tell me > >>>> directly how good the matches are. The current normalization > >>>> does not achieve this. > >>>> > >>>> The intent of a new normalization scheme is to preserve the > >>>> current scoring in the following sense: the ratio of the nth > >>>> result's score to > >>>> the best result's score remains the same. Therefore, the > >>>> only question > >>>> is what normalization factor to apply to all scores. This > >>>> reduces to a > >>>> very specific question that determines the entire > >>>> normalization -- what should the score of the best matching > >>>> result be? > >>>> > >>>> The mechanism below has this property, i.e. it keeps the > >>>> current score > >>>> ratios, except that I removed one idf term for reasons > >>>> covered earlier > >>>> (this better reflects the current empirically best scoring > >>>> algorithms). > >>>> However, removing an idf is a completely separate issue. The > >>>> improved > >>>> normalization is independent of whether or not that change is > >>>> made. > >>>> > >>>> For the central question of what the top score should be, > >>>> upon reflection, I don't like the definition below. It > >>>> defined the top score > >>>> as (approximately) the percentage of query terms matched by > >>>> the top scoring result. A better conceptual definition is to > >>>> use a weighted average based on the boosts. I.e., downward > >>>> propagate all boosts to the > >>>> underlying terms (or phrases). Secifically, the "net boost" > >>>> of a term > >>>> is the product of the direct boost of the term and all boosts > >>>> applied to > >>>> encompassing clauses. Then the score of the top result > >>>> becomes the sum > >>>> of the net boosts of its matching terms divided by the sum of > >>>> the net boosts of all query terms. > >>>> > >>>> This definition is a refinement of the original proposal > >>>> below, and it > >>>> could probably be further refined if some impact of the tf, > >>>> idf and/or > >>>> lengthNorm was desired in determining the top score. These > >>>> other factors seems to be harder to normalize for, although > >>>> I've thought of some simple approaches; e.g., assume the > >>>> unmatched terms in the top result have values for these three > >>>> factors that are the average of the > >>>> values of the matched terms, then apply exactly the same > >>>> concept of actual score divided by theorectical maximum > >>>> score. That would eliminate any need to maintain maximum > >>>> value statistics in the index. > >>>> > >>>> As an example of the simple boost-based normalization, for > >>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b -- > >>>> > 3 c -- > >>>> > >>>>> 1 d --> 2 > >>>>> > >>>> So if a and b matched, but not c and d, in the top scoring > >>>> result, its > >>>> score would be 0.75. The normalizer would be 0.75/(current > >>>> score except > >>>> for the current normalization). This normalizer would be > >>>> applied to all > >>>> current scores (minus normalization) to create the normalized > >>>> scores. > >>>> > >>>> For simple query (a b), if only one of the terms matched in > >>>> the top result, then its score would be 0.5, vs. 1.0 or many > >>>> other possible scores today. > >>>> > >>>> In addition to enabling more "intelligent" UI's that > >>>> communicate the quality of results to end-users, the proposal > >>>> below also extends the explain() mechanism to fully explain > >>>> the final normalized score. However, that change is also > >>>> independent -- it could be done with the current scoring. > >>>> > >>>> Am I the only one who would like to see better normalization > >>>> in Lucene? Does anybody have a better approach? > >>>> > >>>> If you've read this far, thanks for indulging me on this. I > >>>> would love > >>>> to see this or something with similar properties in Lucene. > >>>> I really just want to build my app, but as stated below would > >>>> write and contribute this if there is interest in putting it > >>>> in, and nobody else > >>>> wants to write it. Please let me know what you think one way > >>>> or the other. > >>>> > >>>> Thanks, > >>>> > >>>> Chuck > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Chuck Williams > >>>>> Sent: Monday, October 18, 2004 7:04 PM > >>>>> To: 'Lucene Developers List' > >>>>> Subject: RE: idf and explain(), was Re: Search and Scoring > >>>>> > >>>>> Doug Cutting wrote: > >>>>>> If this is a big issue for you, as it seems it is, please > >>>>>> > >>>> submit > >>>> a > >>>>> patch > >>>>>> to optionally disable score normalization in Hits.java. > >>>>>> > >>>>> and: > >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for > >>>>>> > >>>> each > >>>>> document > >>>>>> each time a document is added to the collection, since > >>>>>> > >>>> 'weight(t,d)' > >>>>> is > >>>>>> dependent on global term statistics. This is prohibitivly > >>>>>> > >>>> expensive. > >>>>>> Research has also demonstrated that such cosine > >>>>>> normalization > >>>>>> > >>>> gives > >>>>>> somewhat inferior results (e.g., Singhal's pivoted length > >>>>>> > >>>>> normalization). > >>>>> > >>>>> I'm willing to write, test and contribute code to address > >>>>> the normalization issue, i.e. to yield scores in [0, 1] > >>>>> that are > >>>>> > >>>> meaningful > >>>>> across searches. Unfortunately, this is considerably more > >>>>> > >>>> involved > >>>> that > >>>>> just optionally eliminating the current normalization in > >>>>> Hits. > >>>>> > >>>> Before > >>>>> undertaking this, I'd like to see if there is agreement in > >>>>> > >>>> principle > >>>>> that this is a good idea, and that my specific proposal > >>>>> below is > >>>>> > >>>> the > >>>>> right way to go. Also, I'd like to make sure I've correctly > >>>>> > >>>> inferred > >>>>> the constraints on writing code to be incorporated into > >>>>> Lucene. > >>>>> > >>>>> After looking at this in more detail I agree that the > >>>>> cosine normalization is not the way to go, because of both > >>>>> efficiency > >>>>> > >>>> and > >>>>> effectiveness considerations. A conceptual approach that > >>>>> would > >>>>> > >>>> be > >>>>> efficient, relatively easy to implement, and seems to have > >>>>> at > >>>>> > >>>> least > >>>>> reasonable behavior would be to define the top scoring > >>>>> match to > >>>>> > >>>> have > >>>> a > >>>>> score roughly equal to the percentage of query terms it > >>>>> matches > >>>>> > >>>> (its > >>>>> "netCoord"). Scores below the top hit would be reduced > >>>>> based on > >>>>> > >>>> the > >>>>> ratio of their raw scores to the raw score of the top hit > >>>>> > >>>> (considering > >>>>> all of the current score factors, except that I'd like to > >>>>> remove > >>>>> > >>>> one > >>>> of > >>>>> the idf factors, as discussed earlier). > >>>>> > >>>>> For a couple simple cases: > >>>>> a) the top match for a single term query would always have a > >>>>> > >>>> score > >>>> of > >>>>> 1.0, > >>>>> b) the top scoring match for a BooleanQuery using > >>>>> > >>>> DefaultSimilarity > >>>>> with all non-prohibited TermQuery clauses would have a > >>>>> score of > >>>>> > >>>> m/n, > >>>>> where the hit matches m of the n terms. > >>>>> > >>>>> This isn't optimal, but seems much better than the current > >>>>> > >>>> situation. > >>>>> Consider two single-term queries, s and t. If s matches > >>>>> more > >>>>> > >>>> strongly > >>>>> than t in its top hit (e.g., a higher tf in a shorter > >>>>> field), it > >>>>> > >>>> would > >>>>> be best if the top score of s was greater score than top > >>>>> score of > >>>>> > >>>> t. > >>>>> But this kind of normalization would require keeping > >>>>> additional statistics that so far as I know are not > >>>>> currently in the index, > >>>>> > >>>> like > >>>>> the maximum tf for terms and the minimum length for fields. > >>>>> > >>>> These > >>>> could > >>>>> be expensive to update on deletes. Also, normalizing by > >>>>> such > >>>>> > >>>> factors > >>>>> could yield lower than subjectively reasonable scores in > >>>>> most > >>>>> > >>>> cases, > >>>> so > >>>>> it's not clear it would be better. > >>>>> > >>>>> The semantics above are at least clean, easy to understand, > >>>>> and > >>>>> > >>>> support > >>>>> what seems to me is the most important motivation to do > >>>>> this: > >>>>> > >>>> allowing > >>>>> an application to use simple thresholding to segregate > >>>>> > >>>> likely-to-be- > >>>>> relevant hits from likely-to-be-irrelevant hits. > >>>>> > >>>>> More specifically, for a BooleanQuery of TermQuery's the > >>>>> > >>>> resulting > >>>> score > >>>>> functions would be: > >>>>> > >>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn) > >>>>> > >>>>> baseScore(sbq, doc) = > >>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* > >>>>> lengthNorm(tqi.term.field, doc) > >>>>> > >>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore > >>>>> > >>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) > >>>>> > >>>>> score(sbq, doc) = rawScore * norm > >>>>> > >>>>> rawScore's can be computed in the Scorer.score() methods and > >>>>> > >>>> therefore > >>>>> used to sort the hits (e.g., in the instance method for > >>>>> collect() > >>>>> > >>>> in > >>>> the > >>>>> HitCollector in IndexSearcher.search()). If the top > >>>>> scoring hit > >>>>> > >>>> does > >>>>> not have the highest baseScore, then its score could be > >>>>> less that > >>>>> > >>>> its > >>>>> coord; this seems desirable. These formulas imply that no > >>>>> result > >>>>> > >>>> score > >>>>> can be larger than its coord, so if coord is well-defined > >>>>> (always between 0 and 1) then all results will be > >>>>> normalized between 0 > >>>>> > >>>> and > >>>> 1. > >>>> > >>>>> In general, the netCoord, which takes the place of coord in > >>>>> the > >>>>> > >>>> simple > >>>>> case above, needs to be defined for any query. > >>>>> Conceptually, > >>>>> > >>>> this > >>>>> should be the total percentage of query terms matched by the > >>>>> > >>>> document. > >>>>> It must be recursively computable from the subquery > >>>>> structure and overridable for specific Query types (e.g., > >>>>> to support > >>>>> > >>>> specialized > >>>>> coords, like one that is always 1.0 as is useful in multi- > >>>>> field searching). Suitable default definitions for > >>>>> TermQuery and > >>>>> > >>>> BooleanQuery > >>>>> are: > >>>>> > >>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise > >>>>> > >>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * > >>>>> > >>>> ci.netCoord > >>>> > >>>>> This is not quite percentage of terms matched; e.g., > >>>>> consider a BooleanQuery with two clauses, one of which is a > >>>>> BooleanQuery of > >>>>> > >>>> 3 > >>>> terms > >>>>> and the other which is just a term. However, it doesn't > >>>>> seem to > >>>>> > >>>> be > >>>>> unreasonable for a BooleanQuery to state that its clauses > >>>>> are > >>>>> > >>>> equally > >>>>> important, and this is consistent with the current coord > >>>>> > >>>> behavior. > >>>>> BooleanQuery.netCoord could be overridden for special > >>>>> cases, like > >>>>> > >>>> the > >>>>> pure disjunction I use in my app for field expansions: > >>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord > >>>>> > >>>>> Implementing this would proceed along these lines: 1. For > >>>>> backwards compatibility, add some kind of newScoring > >>>>> > >>>> boolean > >>>>> setting. > >>>>> 2. Update all of these places to behave as indicated if > >>>>> > >>>> newScoring: > >>>>> a. Change Query.weight() to not do any normalization (no > >>>>> > >>>> call > >>>> to > >>>>> sumOfSquaredWeights(), Similarity.queryNorm() or > >>>>> normalize()). b. Update all Query.weight classes to set > >>>>> their value > >>>>> > >>>> according > >>>> to > >>>>> the terms in the score formula above that don't involve the > >>>>> > >>>> document > >>>>> (e.g., boost*idf for TermQuery). > >>>>> c. Add suitable netCoord definitions to all Scorer > >>>>> classes. d. Update all Scorer.score() methods to compute > >>>>> the rawScore > >>>>> > >>>> as > >>>>> above. > >>>>> e. Add the maximum baseScore as a field kept on TopDocs, > >>>>> > >>>> computed > >>>>> in the HitCollector's. > >>>>> f. Change the normalization in Hits to always divide every > >>>>> > >>>> raw > >>>>> score by the maximum baseScore. > >>>>> g. Update all of the current explain() methods to be > >>>>> > >>>> consistent > >>>>> with this scoring, and to either report the rawScore, or to > >>>>> > >>>> report > >>>> the > >>>>> final score if the normalization factor is provided. h. > >>>>> Add Hits.explain() (or better extend Searcher so that it > >>>>> > >>>> keeps > >>>>> the Hits for use in Searcher.explain()) to call the new > >>>>> explain variation with the normalization factor so that > >>>>> final scores are > >>>>> > >>>> fully > >>>>> explained. > >>>>> > >>>>> If this seems like a good idea, please let me know. I'm > >>>>> sure > >>>>> > >>>> there > >>>> are > >>>>> details I've missed that would come out during coding and > >>>>> > >>>> testing. > >>>> Also, > >>>>> the value of this is dependent on how reasonable the final > >>>>> scores > >>>>> > >>>> look, > >>>>> which is hard to tell for sure until it is working. > >>>>> > >>>>> The coding standards for Lucene seem reasonably clear from > >>>>> the > >>>>> > >>>> source > >>>>> code I've read. I could use just simple Java so there > >>>>> shouldn't > >>>>> > >>>> be > >>>> any > >>>>> significant JVM dependencies. The above should be fully > >>>>> backward compatible due to the newScoring flag. > >>>>> > >>>>> On another note, I had to remove the German analyzer in my > >>>>> > >>>> current > >>>> 1.4.2 > >>>>> source configuration because GermanStemmer failed to > >>>>> compile due > >>>>> > >>>> to > >>>> what > >>>>> are apparently Unicode character constants that I've now > >>>>> got as > >>>>> > >>>> illegal > >>>>> two-character character constants. This is presumably an > >>>>> > >>>> encoding > >>>>> problem somewhere that I could track down. It's not > >>>>> important, > >>>>> > >>>> but > >>>> if > >>>>> the answer is obvious to any of you, I'd appreciate the > >>>>> quick > >>>>> > >>>> tip. > >>>> > >>>>> Thanks, > >>>>> > >>>>> Chuck > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: > >>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers > >>>>>> List Subject: Re: idf and explain(), was Re: Search and > >>>>>> Scoring > >>>>>> > >>>>>> Chuck Williams wrote: > >>>>>>> That's a good point on how the standard vector space > >>>>>>> inner > >>>>>>> > >>>> product > >>>>>>> similarity measure does imply that the idf is squared > >>>>>>> > >>>> relative > >>>> to > >>>>> the > >>>>>>> document tf. Even having been aware of this formula > >>>>>>> for a > >>>>>>> > >>>> long > >>>>> time, > >>>>>>> this particular implication never occurred to me. Do > >>>>>>> you > >>>>>>> > >>>> know > >>>> if > >>>>>>> anybody has done precision/recall or other relevancy > >>>>>>> > >>>> empirical > >>>>>>> measurements comparing this vs. a model that does not > >>>>>>> > >>>> square > >>>> idf? > >>>> > >>>>>> No, not that I know of. > >>>>>> > >>>>>>> Regarding normalization, the normalization in Hits does > >>>>>>> not > >>>>>>> > >>>> have > >>>>> very > >>>>>>> nice properties. Due to the > 1.0 threshold check, it > >>>>>>> > >>>> loses > >>>>>>> information, and it arbitrarily defines the highest > >>>>>>> scoring > >>>>>>> > >>>> result > >>>>> in > >>>>>>> any list that generates scores above 1.0 as a perfect > >>>>>>> > >>>> match. > >>>> It > >>>>> would > >>>>>>> be nice if score values were meaningful independent of > >>>>>>> > >>>> searches, > >>>>> e.g., > >>>>>>> if "0.6" meant the same quality of retrieval > >>>>>>> independent of > >>>>>>> > >>>> what > >>>>>> search > >>>>>>> was done. This would allow, for example, sites to use > >>>>>>> a a > >>>>>>> > >>>> simple > >>>>>>> quality threshold to only show results that were "good > >>>>>>> > >>>> enough". > >>>>> At my > >>>>>>> last company (I was President and head of engineering > >>>>>>> for > >>>>>>> > >>>> InQuira), > >>>>> we > >>>>>>> found this to be important to many customers. > >>>>>>> > >>>>>> If this is a big issue for you, as it seems it is, please > >>>>>> > >>>> submit > >>>> a > >>>>> patch > >>>>>> to optionally disable score normalization in Hits.java. > >>>>>> > >>>>>>> The standard vector space similarity measure includes > >>>>>>> > >>>>> normalization by > >>>>>>> the product of the norms of the vectors, i.e.: > >>>>>>> > >>>>>>> score(d,q) = sum over t of ( weight(t,q) * weight(t,d) > >>>>>>> ) > >>>>>>> > >>>> / > >>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t) > >>>>>>> > >>>>>> weight(t,d)^2) ] > >>>>>> > >>>>>>> This makes the score a cosine, which since the values > >>>>>>> are > >>>>>>> > >>>> all > >>>>> positive, > >>>>>>> forces it to be in [0, 1]. The sumOfSquares() > >>>>>>> > >>>> normalization > >>>> in > >>>>> Lucene > >>>>>>> does not fully implement this. Is there a specific > >>>>>>> reason > >>>>>>> > >>>> for > >>>>> that? > >>>>> > >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for > >>>>>> > >>>> each > >>>>> document > >>>>>> each time a document is added to the collection, since > >>>>>> > >>>> 'weight(t,d)' > >>>>> is > >>>>>> dependent on global term statistics. This is prohibitivly > >>>>>> > >>>> expensive. > >>>>>> Research has also demonstrated that such cosine > >>>>>> normalization > >>>>>> > >>>> gives > >>>>>> somewhat inferior results (e.g., Singhal's pivoted length > >>>>>> > >>>>> normalization). > >>>>> > >>>>>>> Re. explain(), I don't see a downside to extending it > >>>>>>> show > >>>>>>> > >>>> the > >>>>> final > >>>>>>> normalization in Hits. It could still show the raw > >>>>>>> score > >>>>>>> > >>>> just > >>>>> prior > >>>>>> to > >>>>>>> that normalization. > >>>>>>> > >>>>>> In order to normalize scores to 1.0 one must know the > >>>>>> maximum > >>>>>> > >>>> score. > >>>>>> Explain only computes the score for a single document, and > >>>>>> > >>>> the > >>>>> maximum > >>>>>> score is not known. > >>>>>> > >>>>>>> Although I think it would be best to have a > >>>>>>> normalization that would render scores comparable across > >>>>>>> > >>>> searches. > >>>> > >>>>>> Then please submit a patch. Lucene doesn't change on its > >>>>>> > >>>> own. > >>>> > >>>>>> Doug > >>>>>> > >>>>>> > >>>> -------------------------------------------------------------- > >>>> ---- -- > >>>> > >>>>> - > >>>>>> To unsubscribe, e-mail: > >>>> [EMAIL PROTECTED] > >>>>>> For additional commands, e-mail: > >>>> [EMAIL PROTECTED] > >>>> > >>>> > >>>> -------------------------------------------------------------- > >>>> ---- --- To unsubscribe, e-mail: lucene-dev- > >>>> [EMAIL PROTECTED] For additional commands, e- > >>>> mail: [EMAIL PROTECTED] > >>> > >>> > > -------------------------------------------------------------------- > > > >>> - To unsubscribe, e-mail: lucene-dev- > >>> [EMAIL PROTECTED] For additional commands, e-mail: > >>> [EMAIL PROTECTED] > >> > >> > >> ------------------------------------------------------------------ > >> --- To unsubscribe, e-mail: lucene-dev- > >> [EMAIL PROTECTED] For additional commands, e-mail: > >> [EMAIL PROTECTED] > > > > > > -------------------------------------------------------------------- > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]