date:20210713

Re: How exactly the normalized length of the documents are stored in the index

2021-07-13 Thread Adrien Grand

The BM25 similarity computes the normalized length as the number of tokens, ignoring synonyms (tokens at the same position). Then it encodes this length as an 8-bit integer in the index using this logic: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFl

How exactly the normalized length of the documents are stored in the index

2021-07-13 Thread Dwaipayan Roy

During indexing, an inverted index is made with the term of the documents and the term frequency, document frequency etc. are stored. If I know correctly, the exact document length is not stored in the index to reduce the size. Instead, a normalized length is stored for each document. However, for