People!!! thanks for sharing, but this is what the wiki is for!! Will everyone kindly add their posts to http://wiki.apache.org/jakarta-lucene/InformationRetrieval for posterity?
k On Mon, 7 Feb 2005 12:48:02 -0500, Joaquin Delgado wrote: > A very solid (and free) online course on "intelligent information > retrieval" with focus on practical issues can be found on Prof. > Mooney (Univ. of Texas): http://www.cs.utexas.edu/users/mooney/ir- > course/ > > > I've also copied two interesting papers (from my own private > library -- if you are interested I've got much more to offer) for > your reading: > > > For those looking for an answer to the unsolved mystery in IR: > http://www.triplehop.com/pdf/How_Many_Relevances_in_IR.pdf > > For those interesting in extending IR systems with Statistical NLP > and Machine Learning > http://www.triplehop.com/pdf/Text_Representation_and_ML_General_Concep > ts .pdf > > > BTW, if someone has had a look at http://www.find.com (meta-search, > indexing and concept clustering system) I'd be interested in their > opinion. > > > -- Joaquin > > > -----Original Message----- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: > Monday, February 07, 2005 12:15 PM To: Lucene Developers List > Subject: Re: Study Group (WAS Re: Normalized Scoring) > > I think I see what you are after. I'm after the same knowledge. :) > > The only things that I can recommend are books: > Modern Information Retrieval > Managing Gigabytes > > And online resources like: > http://finance.groups.yahoo.com/group/mg/ (note the weird host > name) http://www.sims.berkeley.edu/~hearst/irbook/ > > There is a pile of stuff in Citeseer, but those papers never really > dig into the details and always require solid previous knowledge of > the field. They are no replacement for a class or a textbook. > > If you find a good forum for IR, please share. > > Otis > > > --- Kelvin Tan <[EMAIL PROTECTED]> wrote: > >> Wouldn't it be great if we can form a study-group of Lucene folks >> who want to take the "next step"? I feel uneasy posting non- >> Lucene specific questions to dev or user even if its related to >> IR. >> >> Feels to me like there could be a couple like us, who didn't do a >> dissertation in IR, but would like a more indepth knowledge for >> practical purposes. Basically, the end result is that we are able >> to tune or extend lucene by using the Expert api (classes marked >> as Expert). Perhaps a possible outcome is a tuning tutorial for >> advanced users who already know how to use Lucene. >> >> What do you think? >> >> k >> >> On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote: >> >>> Exactly. Luckily, since then I've learned a bit from lucene- >>> dev discussions and side IR readings, so some of the topics are >>> making more sense now. >>> >>> Otis >>> >>> --- Kelvin Tan <[EMAIL PROTECTED]> wrote: >>> >>>> Hi Otis, I was re-reading this whole theoretical thread about >>>> idf, scoring, normalization, etc from last Oct and couldn't >>>> help laughing out loud when I read your post, coz it summed >>>> up what I was thinking the whole time. I think its really >>>> great to have people like Chuck and Paul (Eshlot) around. I'm >>>> learning so much. >>>> >>>> k >>>> >>>> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic >>>> wrote: >>>> >>>>> Hi Chuck, >>>>> >>>>> The relative lack of responses is not because there is no >>>>> interest, but probably because there are only a few people >>>>> on lucene-dev who can follow/understand every detail of >>>>> your proposal. I understand and hear you, but I have a >>>>> hard time 'visualizing' some of the formulas in your >>>>> proposal. What you are saying sounds right to me, but I >>>>> don't have enough theoretical knowledge to go one way or >>>>> the other. >>>>> >>>>> Otis >>>>> >>>>> >>>>> --- Chuck Williams <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hi everybody, >>>>>> >>>>>> Although there doesn't seem to be much interest in this I >>>>>> have one significant improvement to the below and a >>>>>> couple observations that clarify the situation. >>>>>> >>>>>> To illustrate the problem better normalization is >>>>>> intended to address, >>>>>> in my current application for BooleanQuery's of two >>>>>> terms, I frequently >>>>>> get a top score of 1.0 when only one of the terms is >>>>>> matched. 1.0 should indicate a "perfect match". I'd like >>>>>> set my UI up to present the >>>>>> results differently depending on how good the different >>>>>> results are (e.g., showing a visual indication of result >>>>>> quality, dropping the really bad results entirely, and >>>>>> segregating the good results from other >>>>>> only vaguely relevant results). To build this kind of >>>>>> "intelligence" into the UI, I certainly need to know >>>>>> whether my top result matched all >>>>>> query terms, or only half of them. I'd like to have the >>>>>> score tell me >>>>>> directly how good the matches are. The current >>>>>> normalization does not achieve this. >>>>>> >>>>>> The intent of a new normalization scheme is to preserve >>>>>> the current scoring in the following sense: the ratio of >>>>>> the nth result's score to >>>>>> the best result's score remains the same. Therefore, the >>>>>> only question >>>>>> is what normalization factor to apply to all scores. >>>>>> This reduces to a >>>>>> very specific question that determines the entire >>>>>> normalization -- what should the score of the best >>>>>> matching result be? >>>>>> >>>>>> The mechanism below has this property, i.e. it keeps the >>>>>> current score >>>>>> ratios, except that I removed one idf term for reasons >>>>>> covered earlier >>>>>> (this better reflects the current empirically best >>>>>> scoring algorithms). >>>>>> However, removing an idf is a completely separate issue. >>>>>> The improved >>>>>> normalization is independent of whether or not that >>>>>> change is made. >>>>>> >>>>>> For the central question of what the top score should be, >>>>>> upon reflection, I don't like the definition below. It >>>>>> defined the top score >>>>>> as (approximately) the percentage of query terms matched >>>>>> by the top scoring result. A better conceptual >>>>>> definition is to use a weighted average based on the >>>>>> boosts. I.e., downward propagate all boosts to the >>>>>> underlying terms (or phrases). Secifically, the "net >>>>>> boost" of a term >>>>>> is the product of the direct boost of the term and all >>>>>> boosts applied to >>>>>> encompassing clauses. Then the score of the top result >>>>>> becomes the sum >>>>>> of the net boosts of its matching terms divided by the >>>>>> sum of the net boosts of all query terms. >>>>>> >>>>>> This definition is a refinement of the original proposal >>>>>> below, and it >>>>>> could probably be further refined if some impact of the >>>>>> tf, idf and/or >>>>>> lengthNorm was desired in determining the top score. >>>>>> These other factors seems to be harder to normalize for, >>>>>> although I've thought of some simple approaches; e.g., >>>>>> assume the unmatched terms in the top result have values >>>>>> for these three factors that are the average of the >>>>>> values of the matched terms, then apply exactly the same >>>>>> concept of actual score divided by theorectical maximum >>>>>> score. That would eliminate any need to maintain maximum >>>>>> value statistics in the index. >>>>>> >>>>>> As an example of the simple boost-based normalization, >>>>>> for the query ((a^2 b)^3 (c d^2)) the net boosts are: a -- >>>>>> > 6 b -- >>>>>> >>>>>>> 3 c -- >>>>>>> >>>>>>> 1 d --> 2 >>>>>>> >>>>>> So if a and b matched, but not c and d, in the top >>>>>> scoring result, its >>>>>> score would be 0.75. The normalizer would be >>>>>> 0.75/(current score except >>>>>> for the current normalization). This normalizer would be >>>>>> applied to all >>>>>> current scores (minus normalization) to create the >>>>>> normalized scores. >>>>>> >>>>>> For simple query (a b), if only one of the terms matched >>>>>> in the top result, then its score would be 0.5, vs. 1.0 >>>>>> or many other possible scores today. >>>>>> >>>>>> In addition to enabling more "intelligent" UI's that >>>>>> communicate the quality of results to end-users, the >>>>>> proposal below also extends the explain() mechanism to >>>>>> fully explain the final normalized score. However, that >>>>>> change is also independent -- it could be done with the >>>>>> current scoring. >>>>>> >>>>>> Am I the only one who would like to see better >>>>>> normalization in Lucene? Does anybody have a better >>>>>> approach? >>>>>> >>>>>> If you've read this far, thanks for indulging me on this. >>>>>> I would love >>>>>> to see this or something with similar properties in >>>>>> Lucene. I really just want to build my app, but as stated >>>>>> below would write and contribute this if there is >>>>>> interest in putting it in, and nobody else >>>>>> wants to write it. Please let me know what you think one >>>>>> way or the other. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Chuck >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Chuck Williams >>>>>>> Sent: Monday, October 18, 2004 7:04 PM >>>>>>> To: 'Lucene Developers List' >>>>>>> Subject: RE: idf and explain(), was Re: Search and >>>>>>> Scoring >>>>>>> >>>>>>> Doug Cutting wrote: >>>>>>>> If this is a big issue for you, as it seems it is, >>>>>>>> please >>>>>>>> >>>>>> submit >>>>>> a >>>>>>> patch >>>>>>>> to optionally disable score normalization in >>>>>>>> Hits.java. >>>>>>>> >>>>>>> and: >>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be >>>>>>>> recomputed for >>>>>>>> >>>>>> each >>>>>>> document >>>>>>>> each time a document is added to the collection, since >>>>>>>> >>>>>> 'weight(t,d)' >>>>>>> is >>>>>>>> dependent on global term statistics. This is >>>>>>>> prohibitivly >>>>>>>> >>>>>> expensive. >>>>>>>> Research has also demonstrated that such cosine >>>>>>>> normalization >>>>>>>> >>>>>> gives >>>>>>>> somewhat inferior results (e.g., Singhal's pivoted >>>>>>>> length >>>>>>>> >>>>>>> normalization). >>>>>>> >>>>>>> I'm willing to write, test and contribute code to >>>>>>> address the normalization issue, i.e. to yield scores >>>>>>> in [0, 1] that are >>>>>>> >>>>>> meaningful >>>>>>> across searches. Unfortunately, this is considerably >>>>>>> more >>>>>>> >>>>>> involved >>>>>> that >>>>>>> just optionally eliminating the current normalization >>>>>>> in Hits. >>>>>>> >>>>>> Before >>>>>>> undertaking this, I'd like to see if there is agreement >>>>>>> in >>>>>>> >>>>>> principle >>>>>>> that this is a good idea, and that my specific proposal >>>>>>> below is >>>>>>> >>>>>> the >>>>>>> right way to go. Also, I'd like to make sure I've >>>>>>> correctly >>>>>>> >>>>>> inferred >>>>>>> the constraints on writing code to be incorporated into >>>>>>> Lucene. >>>>>>> >>>>>>> After looking at this in more detail I agree that the >>>>>>> cosine normalization is not the way to go, because of >>>>>>> both efficiency >>>>>>> >>>>>> and >>>>>>> effectiveness considerations. A conceptual approach >>>>>>> that would >>>>>>> >>>>>> be >>>>>>> efficient, relatively easy to implement, and seems to >>>>>>> have at >>>>>>> >>>>>> least >>>>>>> reasonable behavior would be to define the top scoring >>>>>>> match to >>>>>>> >>>>>> have >>>>>> a >>>>>>> score roughly equal to the percentage of query terms it >>>>>>> matches >>>>>>> >>>>>> (its >>>>>>> "netCoord"). Scores below the top hit would be reduced >>>>>>> based on >>>>>>> >>>>>> the >>>>>>> ratio of their raw scores to the raw score of the top >>>>>>> hit >>>>>>> >>>>>> (considering >>>>>>> all of the current score factors, except that I'd like >>>>>>> to remove >>>>>>> >>>>>> one >>>>>> of >>>>>>> the idf factors, as discussed earlier). >>>>>>> >>>>>>> For a couple simple cases: >>>>>>> a) the top match for a single term query would always >>>>>>> have a >>>>>>> >>>>>> score >>>>>> of >>>>>>> 1.0, >>>>>>> b) the top scoring match for a BooleanQuery using >>>>>>> >>>>>> DefaultSimilarity >>>>>>> with all non-prohibited TermQuery clauses would have a >>>>>>> score of >>>>>>> >>>>>> m/n, >>>>>>> where the hit matches m of the n terms. >>>>>>> >>>>>>> This isn't optimal, but seems much better than the >>>>>>> current >>>>>>> >>>>>> situation. >>>>>>> Consider two single-term queries, s and t. If s >>>>>>> matches more >>>>>>> >>>>>> strongly >>>>>>> than t in its top hit (e.g., a higher tf in a shorter >>>>>>> field), it >>>>>>> >>>>>> would >>>>>>> be best if the top score of s was greater score than >>>>>>> top score of >>>>>>> >>>>>> t. >>>>>>> But this kind of normalization would require keeping >>>>>>> additional statistics that so far as I know are not >>>>>>> currently in the index, >>>>>>> >>>>>> like >>>>>>> the maximum tf for terms and the minimum length for >>>>>>> fields. >>>>>>> >>>>>> These >>>>>> could >>>>>>> be expensive to update on deletes. Also, normalizing >>>>>>> by such >>>>>>> >>>>>> factors >>>>>>> could yield lower than subjectively reasonable scores >>>>>>> in most >>>>>>> >>>>>> cases, >>>>>> so >>>>>>> it's not clear it would be better. >>>>>>> >>>>>>> The semantics above are at least clean, easy to >>>>>>> understand, and >>>>>>> >>>>>> support >>>>>>> what seems to me is the most important motivation to do >>>>>>> this: >>>>>>> >>>>>> allowing >>>>>>> an application to use simple thresholding to segregate >>>>>>> >>>>>> likely-to-be- >>>>>>> relevant hits from likely-to-be-irrelevant hits. >>>>>>> >>>>>>> More specifically, for a BooleanQuery of TermQuery's the >>>>>>> >>>>>> resulting >>>>>> score >>>>>>> functions would be: >>>>>>> >>>>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn) >>>>>>> >>>>>>> baseScore(sbq, doc) = >>>>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* >>>>>>> lengthNorm(tqi.term.field, doc) >>>>>>> >>>>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore >>>>>>> >>>>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, >>>>>>> hit) >>>>>>> >>>>>>> score(sbq, doc) = rawScore * norm >>>>>>> >>>>>>> rawScore's can be computed in the Scorer.score() >>>>>>> methods and >>>>>>> >>>>>> therefore >>>>>>> used to sort the hits (e.g., in the instance method for >>>>>>> collect() >>>>>>> >>>>>> in >>>>>> the >>>>>>> HitCollector in IndexSearcher.search()). If the top >>>>>>> scoring hit >>>>>>> >>>>>> does >>>>>>> not have the highest baseScore, then its score could be >>>>>>> less that >>>>>>> >>>>>> its >>>>>>> coord; this seems desirable. These formulas imply that >>>>>>> no result >>>>>>> >>>>>> score >>>>>>> can be larger than its coord, so if coord is well- >>>>>>> defined (always between 0 and 1) then all results will >>>>>>> be normalized between 0 >>>>>>> >>>>>> and >>>>>> 1. >>>>>> >>>>>>> In general, the netCoord, which takes the place of >>>>>>> coord in the >>>>>>> >>>>>> simple >>>>>>> case above, needs to be defined for any query. >>>>>>> Conceptually, >>>>>>> >>>>>> this >>>>>>> should be the total percentage of query terms matched >>>>>>> by the >>>>>>> >>>>>> document. >>>>>>> It must be recursively computable from the subquery >>>>>>> structure and overridable for specific Query types >>>>>>> (e.g., to support >>>>>>> >>>>>> specialized >>>>>>> coords, like one that is always 1.0 as is useful in >>>>>>> multi- field searching). Suitable default definitions >>>>>>> for TermQuery and >>>>>>> >>>>>> BooleanQuery >>>>>>> are: >>>>>>> >>>>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise >>>>>>> >>>>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * >>>>>>> >>>>>> ci.netCoord >>>>>> >>>>>>> This is not quite percentage of terms matched; e.g., >>>>>>> consider a BooleanQuery with two clauses, one of which >>>>>>> is a BooleanQuery of >>>>>>> >>>>>> 3 >>>>>> terms >>>>>>> and the other which is just a term. However, it >>>>>>> doesn't seem to >>>>>>> >>>>>> be >>>>>>> unreasonable for a BooleanQuery to state that its >>>>>>> clauses are >>>>>>> >>>>>> equally >>>>>>> important, and this is consistent with the current coord >>>>>>> >>>>>> behavior. >>>>>>> BooleanQuery.netCoord could be overridden for special >>>>>>> cases, like >>>>>>> >>>>>> the >>>>>>> pure disjunction I use in my app for field expansions: >>>>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) >>>>>>> ci.netCoord >>>>>>> >>>>>>> Implementing this would proceed along these lines: 1. >>>>>>> For backwards compatibility, add some kind of newScoring >>>>>>> >>>>>> boolean >>>>>>> setting. >>>>>>> 2. Update all of these places to behave as indicated if >>>>>>> >>>>>> newScoring: >>>>>>> a. Change Query.weight() to not do any normalization >>>>>>> (no >>>>>>> >>>>>> call >>>>>> to >>>>>>> sumOfSquaredWeights(), Similarity.queryNorm() or >>>>>>> normalize()). b. Update all Query.weight classes to >>>>>>> set their value >>>>>>> >>>>>> according >>>>>> to >>>>>>> the terms in the score formula above that don't involve >>>>>>> the >>>>>>> >>>>>> document >>>>>>> (e.g., boost*idf for TermQuery). >>>>>>> c. Add suitable netCoord definitions to all Scorer >>>>>>> classes. d. Update all Scorer.score() methods to >>>>>>> compute the rawScore >>>>>>> >>>>>> as >>>>>>> above. >>>>>>> e. Add the maximum baseScore as a field kept on >>>>>>> TopDocs, >>>>>>> >>>>>> computed >>>>>>> in the HitCollector's. >>>>>>> f. Change the normalization in Hits to always divide >>>>>>> every >>>>>>> >>>>>> raw >>>>>>> score by the maximum baseScore. >>>>>>> g. Update all of the current explain() methods to be >>>>>>> >>>>>> consistent >>>>>>> with this scoring, and to either report the rawScore, >>>>>>> or to >>>>>>> >>>>>> report >>>>>> the >>>>>>> final score if the normalization factor is provided. h. >>>>>>> Add Hits.explain() (or better extend Searcher so that it >>>>>>> >>>>>> keeps >>>>>>> the Hits for use in Searcher.explain()) to call the new >>>>>>> explain variation with the normalization factor so that >>>>>>> final scores are >>>>>>> >>>>>> fully >>>>>>> explained. >>>>>>> >>>>>>> If this seems like a good idea, please let me know. >>>>>>> I'm sure >>>>>>> >>>>>> there >>>>>> are >>>>>>> details I've missed that would come out during coding >>>>>>> and >>>>>>> >>>>>> testing. >>>>>> Also, >>>>>>> the value of this is dependent on how reasonable the >>>>>>> final scores >>>>>>> >>>>>> look, >>>>>>> which is hard to tell for sure until it is working. >>>>>>> >>>>>>> The coding standards for Lucene seem reasonably clear >>>>>>> from the >>>>>>> >>>>>> source >>>>>>> code I've read. I could use just simple Java so there >>>>>>> shouldn't >>>>>>> >>>>>> be >>>>>> any >>>>>>> significant JVM dependencies. The above should be >>>>>>> fully backward compatible due to the newScoring flag. >>>>>>> >>>>>>> On another note, I had to remove the German analyzer in >>>>>>> my >>>>>>> >>>>>> current >>>>>> 1.4.2 >>>>>>> source configuration because GermanStemmer failed to >>>>>>> compile due >>>>>>> >>>>>> to >>>>>> what >>>>>>> are apparently Unicode character constants that I've >>>>>>> now got as >>>>>>> >>>>>> illegal >>>>>>> two-character character constants. This is presumably >>>>>>> an >>>>>>> >>>>>> encoding >>>>>>> problem somewhere that I could track down. It's not >>>>>>> important, >>>>>>> >>>>>> but >>>>>> if >>>>>>> the answer is obvious to any of you, I'd appreciate the >>>>>>> quick >>>>>>> >>>>>> tip. >>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Chuck >>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: >>>>>>>> Monday, October 18, 2004 9:44 AM To: Lucene >>>>>>>> Developers List Subject: Re: idf and explain(), was >>>>>>>> Re: Search and Scoring >>>>>>>> >>>>>>>> Chuck Williams wrote: >>>>>>>>> That's a good point on how the standard vector >>>>>>>>> space inner >>>>>>>>> >>>>>> product >>>>>>>>> similarity measure does imply that the idf is >>>>>>>>> squared >>>>>>>>> >>>>>> relative >>>>>> to >>>>>>> the >>>>>>>>> document tf. Even having been aware of this >>>>>>>>> formula for a >>>>>>>>> >>>>>> long >>>>>>> time, >>>>>>>>> this particular implication never occurred to me. >>>>>>>>> Do you >>>>>>>>> >>>>>> know >>>>>> if >>>>>>>>> anybody has done precision/recall or other relevancy >>>>>>>>> >>>>>> empirical >>>>>>>>> measurements comparing this vs. a model that does >>>>>>>>> not >>>>>>>>> >>>>>> square >>>>>> idf? >>>>>> >>>>>>>> No, not that I know of. >>>>>>>> >>>>>>>>> Regarding normalization, the normalization in Hits >>>>>>>>> does not >>>>>>>>> >>>>>> have >>>>>>> very >>>>>>>>> nice properties. Due to the > 1.0 threshold check, >>>>>>>>> it >>>>>>>>> >>>>>> loses >>>>>>>>> information, and it arbitrarily defines the highest >>>>>>>>> scoring >>>>>>>>> >>>>>> result >>>>>>> in >>>>>>>>> any list that generates scores above 1.0 as a >>>>>>>>> perfect >>>>>>>>> >>>>>> match. >>>>>> It >>>>>>> would >>>>>>>>> be nice if score values were meaningful independent >>>>>>>>> of >>>>>>>>> >>>>>> searches, >>>>>>> e.g., >>>>>>>>> if "0.6" meant the same quality of retrieval >>>>>>>>> independent of >>>>>>>>> >>>>>> what >>>>>>>> search >>>>>>>>> was done. This would allow, for example, sites to >>>>>>>>> use a a >>>>>>>>> >>>>>> simple >>>>>>>>> quality threshold to only show results that were >>>>>>>>> "good >>>>>>>>> >>>>>> enough". >>>>>>> At my >>>>>>>>> last company (I was President and head of >>>>>>>>> engineering for >>>>>>>>> >>>>>> InQuira), >>>>>>> we >>>>>>>>> found this to be important to many customers. >>>>>>>>> >>>>>>>> If this is a big issue for you, as it seems it is, >>>>>>>> please >>>>>>>> >>>>>> submit >>>>>> a >>>>>>> patch >>>>>>>> to optionally disable score normalization in >>>>>>>> Hits.java. >>>>>>>> >>>>>>>>> The standard vector space similarity measure >>>>>>>>> includes >>>>>>>>> >>>>>>> normalization by >>>>>>>>> the product of the norms of the vectors, i.e.: >>>>>>>>> >>>>>>>>> score(d,q) = sum over t of ( weight(t,q) * >>>>>>>>> weight(t,d) ) >>>>>>>>> >>>>>> / >>>>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t) >>>>>>>>> >>>>>>>> weight(t,d)^2) ] >>>>>>>> >>>>>>>>> This makes the score a cosine, which since the >>>>>>>>> values are >>>>>>>>> >>>>>> all >>>>>>> positive, >>>>>>>>> forces it to be in [0, 1]. The sumOfSquares() >>>>>>>>> >>>>>> normalization >>>>>> in >>>>>>> Lucene >>>>>>>>> does not fully implement this. Is there a specific >>>>>>>>> reason >>>>>>>>> >>>>>> for >>>>>>> that? >>>>>>> >>>>>>>> The quantity 'sum(t) weight(t,d)^2' must be >>>>>>>> recomputed for >>>>>>>> >>>>>> each >>>>>>> document >>>>>>>> each time a document is added to the collection, since >>>>>>>> >>>>>> 'weight(t,d)' >>>>>>> is >>>>>>>> dependent on global term statistics. This is >>>>>>>> prohibitivly >>>>>>>> >>>>>> expensive. >>>>>>>> Research has also demonstrated that such cosine >>>>>>>> normalization >>>>>>>> >>>>>> gives >>>>>>>> somewhat inferior results (e.g., Singhal's pivoted >>>>>>>> length >>>>>>>> >>>>>>> normalization). >>>>>>> >>>>>>>>> Re. explain(), I don't see a downside to extending >>>>>>>>> it show >>>>>>>>> >>>>>> the >>>>>>> final >>>>>>>>> normalization in Hits. It could still show the raw >>>>>>>>> score >>>>>>>>> >>>>>> just >>>>>>> prior >>>>>>>> to >>>>>>>>> that normalization. >>>>>>>>> >>>>>>>> In order to normalize scores to 1.0 one must know the >>>>>>>> maximum >>>>>>>> >>>>>> score. >>>>>>>> Explain only computes the score for a single >>>>>>>> document, and >>>>>>>> >>>>>> the >>>>>>> maximum >>>>>>>> score is not known. >>>>>>>> >>>>>>>>> Although I think it would be best to have a >>>>>>>>> normalization that would render scores comparable >>>>>>>>> across >>>>>>>>> >>>>>> searches. >>>>>> >>>>>>>> Then please submit a patch. Lucene doesn't change on >>>>>>>> its >>>>>>>> >>>>>> own. >>>>>> >>>>>>>> Doug >>>>>>>> >>>>>>>> >>>>>> ---------------------------------------------------------- >>>>>> ---- ---- -- >>>>>> >>>>>>> - >>>>>>>> To unsubscribe, e-mail: >>>>>>[EMAIL PROTECTED] >>>>>>>> For additional commands, e-mail: >>>>>>[EMAIL PROTECTED] >>>>>> >>>>>> >>>>>> ---------------------------------------------------------- >>>>>> ---- ---- --- To unsubscribe, e-mail: lucene-dev- >>>>>>[EMAIL PROTECTED] For additional commands, e- >>>>>> mail: [EMAIL PROTECTED] >>>>> >>> > -------------------------------------------------------------------- > >>>>> - To unsubscribe, e-mail: lucene-dev- >>>>>[EMAIL PROTECTED] For additional commands, e- >>>>> mail: [EMAIL PROTECTED] >>>> >>>> >>>> -------------------------------------------------------------- >>>> ---- --- To unsubscribe, e-mail: lucene-dev- >>>>[EMAIL PROTECTED] For additional commands, e- >>>> mail: [EMAIL PROTECTED] >>> >>> > -------------------------------------------------------------------- > >>> - To unsubscribe, e-mail: lucene-dev- >>>[EMAIL PROTECTED] For additional commands, e-mail: >>>[EMAIL PROTECTED] >> >> >> ------------------------------------------------------------------ >> --- To unsubscribe, e-mail: lucene-dev- >>[EMAIL PROTECTED] For additional commands, e-mail: >>[EMAIL PROTECTED] > > > -------------------------------------------------------------------- > - To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -------------------------------------------------------------------- > - To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]