[jira] [Commented] (LUCENE-7485) Better storage for `docsWithField` in Lucene70NormsFormat

Ryan Ernst (JIRA) Mon, 10 Oct 2016 13:27:33 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563393#comment-15563393
 ]


Ryan Ernst commented on LUCENE-7485:
------------------------------------

Looks good! A couple minor suggestions:
* I would change {{MAX_ARRAY_LENGTH}} to {{(1 << 12) - 1}} (and adjust 
comparisons accordingly), so that the buffer array is actually created with 
this exact value (otherwise the name is confusing).
* Add explicit tests around the edges of sparse to dense?

> Better storage for `docsWithField` in Lucene70NormsFormat
> ---------------------------------------------------------
>
>                 Key: LUCENE-7485
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7485
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-7485.patch
>
>
> Currently {{Lucene70NormsFormat}} uses a bit set to store documents that have 
> a norm, and counts one bits using {{Long.bitCount}} in order to know the 
> index of the current document in the set of docs that have a norm value.
> I think this is fairly good if a field is moderately sparse (somewhere 
> between 5% and 99%) but it still has some issues like slow advance by large 
> deltas (it still needs to visit all words in order to accumulate the number 
> of ones to know the index of a document) or when very few bits are set.
> I have been working on a disk-based adaptation of {{RoaringDocIdSet}} that 
> would still give the ability to know the index of the current document. It 
> seems to be only a bit slower than the current implementation on moderately 
> sparse fields. However, it also comes with benefits:
>  * it is faster in the sparse case when it uses the sparse encoding that uses 
> shorts to store doc IDs (when the density is 6% or less)
>  * it has faster advance() by large deltas (still linear, but by a factor of 
> 65536 so that should always be fine in practice since doc IDs are bound to 2B)
>  * it uses O(numDocsWithField) storage rather than O(maxDoc), the worst case 
> in 6 bytes per field, which occurs when each range of 65k docs contains 
> exactly one document.
>  * it is faster if some ranges of documents that share the same 16 upper bits 
> are full, this is useful eg. if there is a single document that misses a 
> field in the whole index or for use-cases that would store multiple types of 
> documents (with different fields) within a single index and would use index 
> sorting to put documents of the same type together



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7485) Better storage for `docsWithField` in Lucene70NormsFormat

Reply via email to