Re: getting number of terms in a document/field

Michael McCandless Fri, 06 Feb 2015 07:32:22 -0800

On Fri, Feb 6, 2015 at 8:51 AM, Ahmet Arslan <[email protected]> wrote:
> Hi Michael,
>
> Thanks for the explanation. I am working with a TREC dataset,
> since it is static, I set size of that array experimentally.
>
> I followed the DefaultSimilarity#lengthNorm method a bit.
>
> If default similarity and no index time boost is used,
> I assume that norm equals to  1.0 / Math.sqrt(numTerms).
>
> First option is somehow obtain pre-computed norm value and apply reverse 
> operation to obtain numTerms.
> numTerms = (1/norm)^2  This will be an approximation because norms are stored 
> in a byte.
> How do I access that norm value for a given docid and a field?


See the AtomicReader.getNormValues method.

> Second option, I store numTerms as a separate field, like any other organic 
> fields.
> Do I need to calculate it by myself? Or can I access above already computed 
> numTerms value during indexing?
>
> I think I will follow second option.
> Is there a pointer where reading/writing a DocValues based field example is 
> demostrated?

You could just make your own Similarity impl, that encodes the norm
directly as a length?  It's a long so you don't have to compress if
you don't want to.

That custom Similarity is passed FieldInvertState which contains the
number of tokens in the current field, so you can just use that
instead of computing it yourself.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: getting number of terms in a document/field

Reply via email to