Re: How to pull document scoring values

2004-09-29 Thread Paul Elschot
On Wednesday 29 September 2004 15:41, Zia Syed wrote:
> Hi Paul,
> Thanks for your detailed reply! It really helped alot.
> However, I am experiancing some conflicts.
>
> For one of the documents in result set, when i use
>
> IndexReader fir=FilterIndexReader.open("index");
> byte[] fNorm=fir.norm("Body");
> System.out.println("FNorm: "+ fNorm[306]);
> Document d=fir.document(306);
> Field f=d.getField("Body");
>
> System.out.println("Body: "+ f.stringValue());
>
> This gives me out fNorm 113, whereas total number of term (including
> stop-words) are 42 in this particular field of selected document. In the
> explanation , fieldNorm (field=Body, doc=306) is 0.1562, which is approx
> 41 term words for that field in that documents. So explanation values
> makes sense with real data, while including all stop words like to,it,
> the & etc.
>
> So, my question is,
>
> > Am i getting the norm values from right place?

Yes, but the stored norms are encoded/decoded:
byte Similarity.encodeNorm(float)
float Similarity.decodeNorm(byte)

> > Is there any way to find out number of indexed terms for each
>
> document?

By default, the stored norm is the inverse square root of 
the number of indexed terms of an indexed document field.
The encoding/decoding is somewhat rough, though.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to pull document scoring values

2004-09-29 Thread Zia Syed
Hi Paul,
Thanks for your detailed reply! It really helped alot.
However, I am experiancing some conflicts.

For one of the documents in result set, when i use 

IndexReader fir=FilterIndexReader.open("index");
byte[] fNorm=fir.norm("Body");
System.out.println("FNorm: "+ fNorm[306]);
Document d=fir.document(306);
Field f=d.getField("Body");

System.out.println("Body: "+ f.stringValue());

This gives me out fNorm 113, whereas total number of term (including
stop-words) are 42 in this particular field of selected document. In the
explanation , fieldNorm (field=Body, doc=306) is 0.1562, which is approx
41 term words for that field in that documents. So explanation values
makes sense with real data, while including all stop words like to,it,
the & etc. 

So, my question is, 
> Am i getting the norm values from right place?
> Is there any way to find out number of indexed terms for each
document?

Please advise!

Thanks,

Zia



On Wed, 2004-09-29 at 08:17, Paul Elschot wrote:
> Zia,
> 
> On Tuesday 28 September 2004 21:22, you wrote:
> > Hi,
> >
> > I'm trying to learn the Scoring mechanism of Lucene. I want to fetch
> > each parameter value individually as they are collectively dumped out by
> > Explanation. I've managed to pull out TF and IDF values using
> > DefaultSimilarity and FilterIndexReader, but not sure from where to get
> > the fieldNorm and queryNorm from.
> 
> The norms are here:
> http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String)
> The resulting array is indexed by the document number for the IndexReader.
> With the default similarity, each norm is the inverse square root of the number of 
> indexed terms in the 
> document field. However, there are only 8 bits available to encode this value, so 
> it's quite rough.
> 
> The default queryNorm is here:
> http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
> There is an explanation of the scoring in the javadocs of Similarity.
> There has been some discussion on an idf factor that was missing from this 
> documentation, 
> I don't know whether the docs have been adapted for this.
> 
> > Also is there any reference about how normalisation has been
> > implemented?
> 
> See above, DefaultSimilarity is the default implementation of the Similarity 
> interface.
> queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights
> from the query. It returns the square root.
> 
> It may be that the sum of squared weights should be a sum of square rooted weights
> and that queryNorm should return a square then.
> I posted this on lucene-user on 20 September:
> http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=10023
> 
> It's only a normalisation, so it doesn't affect the order of the search results much.
> Taking the square roots of the  query term weights would have
> the query weights directly apllied to the the query term density in the document 
> field,
> whereas now the weights seem to be applied to the square root of the density.
> The density value is an approximation, see above for the rough field norms.
> 
> Regards,
> Paul Elschot
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
-- 
Zia Syed <[EMAIL PROTECTED]>
Smartweb Research Center, Robert Gordon University


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to pull document scoring values

2004-09-29 Thread Paul Elschot
Zia,

On Tuesday 28 September 2004 21:22, you wrote:
> Hi,
>
> I'm trying to learn the Scoring mechanism of Lucene. I want to fetch
> each parameter value individually as they are collectively dumped out by
> Explanation. I've managed to pull out TF and IDF values using
> DefaultSimilarity and FilterIndexReader, but not sure from where to get
> the fieldNorm and queryNorm from.

The norms are here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String)
The resulting array is indexed by the document number for the IndexReader.
With the default similarity, each norm is the inverse square root of the number of 
indexed terms in the 
document field. However, there are only 8 bits available to encode this value, so it's 
quite rough.

The default queryNorm is here:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float)
There is an explanation of the scoring in the javadocs of Similarity.
There has been some discussion on an idf factor that was missing from this 
documentation, 
I don't know whether the docs have been adapted for this.

> Also is there any reference about how normalisation has been
> implemented?

See above, DefaultSimilarity is the default implementation of the Similarity interface.
queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights
from the query. It returns the square root.

It may be that the sum of squared weights should be a sum of square rooted weights
and that queryNorm should return a square then.
I posted this on lucene-user on 20 September:
http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=10023

It's only a normalisation, so it doesn't affect the order of the search results much.
Taking the square roots of the  query term weights would have
the query weights directly apllied to the the query term density in the document field,
whereas now the weights seem to be applied to the square root of the density.
The density value is an approximation, see above for the rough field norms.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to pull document scoring values

2004-09-28 Thread Zia Syed
Hi,

I'm trying to learn the Scoring mechanism of Lucene. I want to fetch
each parameter value individually as they are collectively dumped out by
Explanation. I've managed to pull out TF and IDF values using
DefaultSimilarity and FilterIndexReader, but not sure from where to get
the fieldNorm and queryNorm from. 
Also is there any reference about how normalisation has been
implemented? 

Any idea?

Thanks,
Zia
-- 
Zia Syed <[EMAIL PROTECTED]>
Smartweb Research Center, Robert Gordon University


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]