[ 
https://issues.apache.org/jira/browse/LUCENE-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13256760#comment-13256760
 ] 

Robert Muir commented on LUCENE-3957:
-------------------------------------

I don't understand why its long and winded, its documented in tons of places in 
lucene,
in-fact its actually over-specified in file-formats, for example, because even 
in 3.5
the encoding of the normalization byte is an implementation detail of the 
Similarity:
its just that you can only use a single byte.

In trunk its definitely overspecified since besides the above, the Similarity 
can use
more than a byte if it wants to.

1. Main website (scoring): 
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html
{noformat}
Indexing time boosts are preprocessed for storage efficiency and written to the 
directory (when writing the document) in a single byte (!) as follows.
...
This composition of 1-byte representation of norms...
...
Encoding and decoding of the resulted float norm in a single byte are done by 
the static methods of the class Similarity: encodeNorm() and decodeNorm(). Due 
to loss of precision, it is not guaranteed that decode(encode(x)) = x, e.g. 
decode(encode(0.89)) = 0.75. At scoring (search) time, this norm is brought 
into the score of document as norm(t, d), as shown by the formula in 
Similarity. 
{noformat}

2. Main website (file formats):
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html#Normalization%20Factors
{noformat}
Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, 
and bits 3-8 contain the 5-bit exponent.

These are converted to an IEEE single float value as follows: 
...
{noformat}

3. Javadocs (Similarity): 
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/search/Similarity.html
{noformat}
However the resulted norm value is encoded as a single byte before being 
stored. At search time, the norm byte value is read from the index directory 
and decoded back to a float norm value. This encoding/decoding, while reducing 
index size, comes with the price of precision loss...
 
Compression of norm values to a single byte saves memory at search time, 
because once a field is referenced at search time, its norms - for all 
documents - are maintained in memory.
 
The rationale supporting such lossy compression of norm values is that given 
the difficulty (and inaccuracy) of users to express their true information need 
by a query, only big differences matter. 
{noformat}


                
> Document precision requirements of setBoost calls
> -------------------------------------------------
>
>                 Key: LUCENE-3957
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3957
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: general/javadocs
>    Affects Versions: 3.5
>            Reporter: Jordi Salvat i Alabart
>
> The behaviour of index-time boosts seems pretty erratic (e.g. a boost of 8.0 
> produces the exact same score as a boost of 9.0) until you become aware that 
> these factors end up encoded in a single byte, with a three-bit mantissa. 
> This consumed a whole day of research for us, and I still believe we were 
> lucky to spot it, given how deeply dug into the code & documentation this 
> information is.
> I suggest adding a small note to the JavaDoc of setBoost methods in Document, 
> Fieldable, FieldInvertState, and possibly AbstractField, Field, and 
> NumericField.
> Suggested text:
> "Note that all index-time boost values end up encoded using 
> Similarity.encodeNormValue, with a 3-bit mantissa -- so differences in the 
> boost value of less than 25% may easily be rounded away."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to