Re: How to pull document scoring values
On Wednesday 29 September 2004 15:41, Zia Syed wrote: > Hi Paul, > Thanks for your detailed reply! It really helped alot. > However, I am experiancing some conflicts. > > For one of the documents in result set, when i use > > IndexReader fir=FilterIndexReader.open("index"); > byte[] fNorm=fir.norm("Body"); > System.out.println("FNorm: "+ fNorm[306]); > Document d=fir.document(306); > Field f=d.getField("Body"); > > System.out.println("Body: "+ f.stringValue()); > > This gives me out fNorm 113, whereas total number of term (including > stop-words) are 42 in this particular field of selected document. In the > explanation , fieldNorm (field=Body, doc=306) is 0.1562, which is approx > 41 term words for that field in that documents. So explanation values > makes sense with real data, while including all stop words like to,it, > the & etc. > > So, my question is, > > > Am i getting the norm values from right place? Yes, but the stored norms are encoded/decoded: byte Similarity.encodeNorm(float) float Similarity.decodeNorm(byte) > > Is there any way to find out number of indexed terms for each > > document? By default, the stored norm is the inverse square root of the number of indexed terms of an indexed document field. The encoding/decoding is somewhat rough, though. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to pull document scoring values
Hi Paul, Thanks for your detailed reply! It really helped alot. However, I am experiancing some conflicts. For one of the documents in result set, when i use IndexReader fir=FilterIndexReader.open("index"); byte[] fNorm=fir.norm("Body"); System.out.println("FNorm: "+ fNorm[306]); Document d=fir.document(306); Field f=d.getField("Body"); System.out.println("Body: "+ f.stringValue()); This gives me out fNorm 113, whereas total number of term (including stop-words) are 42 in this particular field of selected document. In the explanation , fieldNorm (field=Body, doc=306) is 0.1562, which is approx 41 term words for that field in that documents. So explanation values makes sense with real data, while including all stop words like to,it, the & etc. So, my question is, > Am i getting the norm values from right place? > Is there any way to find out number of indexed terms for each document? Please advise! Thanks, Zia On Wed, 2004-09-29 at 08:17, Paul Elschot wrote: > Zia, > > On Tuesday 28 September 2004 21:22, you wrote: > > Hi, > > > > I'm trying to learn the Scoring mechanism of Lucene. I want to fetch > > each parameter value individually as they are collectively dumped out by > > Explanation. I've managed to pull out TF and IDF values using > > DefaultSimilarity and FilterIndexReader, but not sure from where to get > > the fieldNorm and queryNorm from. > > The norms are here: > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String) > The resulting array is indexed by the document number for the IndexReader. > With the default similarity, each norm is the inverse square root of the number of > indexed terms in the > document field. However, there are only 8 bits available to encode this value, so > it's quite rough. > > The default queryNorm is here: > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float) > There is an explanation of the scoring in the javadocs of Similarity. > There has been some discussion on an idf factor that was missing from this > documentation, > I don't know whether the docs have been adapted for this. > > > Also is there any reference about how normalisation has been > > implemented? > > See above, DefaultSimilarity is the default implementation of the Similarity > interface. > queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights > from the query. It returns the square root. > > It may be that the sum of squared weights should be a sum of square rooted weights > and that queryNorm should return a square then. > I posted this on lucene-user on 20 September: > http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=10023 > > It's only a normalisation, so it doesn't affect the order of the search results much. > Taking the square roots of the query term weights would have > the query weights directly apllied to the the query term density in the document > field, > whereas now the weights seem to be applied to the square root of the density. > The density value is an approximation, see above for the rough field norms. > > Regards, > Paul Elschot > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] -- Zia Syed <[EMAIL PROTECTED]> Smartweb Research Center, Robert Gordon University - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to pull document scoring values
Zia, On Tuesday 28 September 2004 21:22, you wrote: > Hi, > > I'm trying to learn the Scoring mechanism of Lucene. I want to fetch > each parameter value individually as they are collectively dumped out by > Explanation. I've managed to pull out TF and IDF values using > DefaultSimilarity and FilterIndexReader, but not sure from where to get > the fieldNorm and queryNorm from. The norms are here: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String) The resulting array is indexed by the document number for the IndexReader. With the default similarity, each norm is the inverse square root of the number of indexed terms in the document field. However, there are only 8 bits available to encode this value, so it's quite rough. The default queryNorm is here: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float) There is an explanation of the scoring in the javadocs of Similarity. There has been some discussion on an idf factor that was missing from this documentation, I don't know whether the docs have been adapted for this. > Also is there any reference about how normalisation has been > implemented? See above, DefaultSimilarity is the default implementation of the Similarity interface. queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights from the query. It returns the square root. It may be that the sum of squared weights should be a sum of square rooted weights and that queryNorm should return a square then. I posted this on lucene-user on 20 September: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=10023 It's only a normalisation, so it doesn't affect the order of the search results much. Taking the square roots of the query term weights would have the query weights directly apllied to the the query term density in the document field, whereas now the weights seem to be applied to the square root of the density. The density value is an approximation, see above for the rough field norms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to pull document scoring values
Hi, I'm trying to learn the Scoring mechanism of Lucene. I want to fetch each parameter value individually as they are collectively dumped out by Explanation. I've managed to pull out TF and IDF values using DefaultSimilarity and FilterIndexReader, but not sure from where to get the fieldNorm and queryNorm from. Also is there any reference about how normalisation has been implemented? Any idea? Thanks, Zia -- Zia Syed <[EMAIL PROTECTED]> Smartweb Research Center, Robert Gordon University - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]