Lucene 1.2 - scoring forumla needed

2006-09-10 Thread Karl Koch
Hi,

I am looking for a mathematically correct IR scoring formula for Lucene 1.2. 
The description in the book (Lucene in Action, 2005 edition) is rather 
non-mathematical, also I am not sure if this is the one that also counts for 
Lucene 1.2 and not for later versions.

Perhaps Eric or Otis can directy comment on this? Is there any paper on the 
Lucene scoring algorithm that was published and describes the formula in depth?

Best Regards,
Karl
-- 


Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 1.2 - scoring forumla needed

2006-09-10 Thread Joaquin Delgado
What do you mean by mathematically correct? Is there something incorrect 
in the book?


According to a message posted some time ago at 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200307.mbox/[EMAIL PROTECTED] 
, where people first noticed a change in the scoring algorithm, the 
official FAQ (for 1.2) had posted, from Doug himself the following formula:


score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * 
boost_t) * coord_q_d


where

   * score (q,d) : score for document d given query q
   * sum_t : sum for all terms t in q
   * tf_q : the square root of the frequency of t in q
   * tf_d : the square root of the frequency of t in d
   * idf_t : log(numDocs/docFreq_t+1) + 1.0
   * numDocs : number of documents in index
   * docFreq_t : number of documents containing t
   * norm_q : sqrt(sum_t((tf_q*idf_t)^2))
   * norm_d_t : square root of number of tokens in d in the same field
 as t
   * boost_t : the user-specified boost for term t
   * coord_q_d : number of terms in both query and document / number of
 terms in query The coordination factor gives an AND-like boost to
 documents that contain, e.g., all three terms in a three word
 query over those that contain just two of the words.

This is diffirent that the current scoring algorithm described at 
http://lucene.apache.org/java/docs/scoring.html#Scoring which includes 
field boosting, document length normalization, etc.


In any case these are variations of the TF-IDF weighted vector space 
"cosine of the angle" between the document and the query vectors  (also 
known as cosine distance or normalized dot product - see 
http://en.wikipedia.org/wiki/Dot_product). This computation treats 
documents and queries as vectors in an N-dimensional space (N is the 
number of unique terms excluding stopwords).


In statistics/probabilistc terms this can also be interpretated as a 
geometrical interpretation of correlation  between samples drawn from 
two random variables Q and D (representing a query and a document -see 
http://en.wikipedia.org/wiki/Correlation) whereas each data point 
(TF-IDF weight)  is an estimation of how much "information" each term 
conveys. There are more complex probabilistc rankings algorithms which 
take advantage of previous knowledge of relevance (pre-ranked documents 
for example) in its computation primarily exploiting bayes theorem.


Both Vector Space Model and Probabilistic Model are well studied in 
Information Retrieval Literature. See 
http://www2.sims.berkeley.edu/courses/is202/f00/lectures/Lecture8_202.ppt 
for an overview of Ranking and Feedback.




-- Joaquin Delgado




Karl Koch wrote:


Hi,

I am looking for a mathematically correct IR scoring formula for Lucene 1.2. 
The description in the book (Lucene in Action, 2005 edition) is rather 
non-mathematical, also I am not sure if this is the one that also counts for 
Lucene 1.2 and not for later versions.

Perhaps Eric or Otis can directy comment on this? Is there any paper on the 
Lucene scoring algorithm that was published and describes the formula in depth?

Best Regards,
Karl
 



After kill -9 index was corrupt

2006-09-10 Thread Chuck Williams
Hi All,

An application of ours under development had a memory link that caused
it to slow interminably.  On linux, the application did not response to
kill -15 in a reasonable time, so kill -9 was used to forcibly terminate
it.  After this the segments file contained a reference to a segment
whose index files were not present.  I.e., the index was corrupt and
Lucene could not open it.

A thread dump at the time of the kill -9 shows that Lucene was merging
segments inside IndexWriter.close().  Since segment merging only commits
(updates the segments file) after the newly merged segment(s) are
complete, I expect this is not the actual problem.

Could a kill -9 prevent data from reaching disk for files that were
previously closed?  If so, then Lucene's index can become corrupt after
kill -9.  In this case, it is possible that a prior merge created new
segment index files, updated the segments file, closed everything, the
segments file made it to disk, but the index data files and/or their
directory entries did not.

If this is the case, it seems to me that flush() and
FileDescriptor.sync() are required on each index file prior to close()
to guarantee no corruption.  Additionally a FileDescriptor.sync() is
also probably required on the index directory to ensure the directory
entries have been persisted.

A power failure or other operating system crash could cause this, not
just kill -9.

Does this seem like a possible explanation and fix for what happened? 
Could the same kind of problem happen on Windows?

If this is the issue, then how would people feel about having Lucene do
sync()'s a) always? or b) as an index configuration option?

I need to fix whatever happened and so would submit a patch to resolve it.

Thanks for advice and suggestions,

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]