[
https://issues.apache.org/jira/browse/LUCENE-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15563393#comment-15563393
]
Ryan Ernst commented on LUCENE-7485:
------------------------------------
Looks good! A couple minor suggestions:
* I would change {{MAX_ARRAY_LENGTH}} to {{(1 << 12) - 1}} (and adjust
comparisons accordingly), so that the buffer array is actually created with
this exact value (otherwise the name is confusing).
* Add explicit tests around the edges of sparse to dense?
> Better storage for `docsWithField` in Lucene70NormsFormat
> ---------------------------------------------------------
>
> Key: LUCENE-7485
> URL: https://issues.apache.org/jira/browse/LUCENE-7485
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Assignee: Adrien Grand
> Priority: Minor
> Attachments: LUCENE-7485.patch
>
>
> Currently {{Lucene70NormsFormat}} uses a bit set to store documents that have
> a norm, and counts one bits using {{Long.bitCount}} in order to know the
> index of the current document in the set of docs that have a norm value.
> I think this is fairly good if a field is moderately sparse (somewhere
> between 5% and 99%) but it still has some issues like slow advance by large
> deltas (it still needs to visit all words in order to accumulate the number
> of ones to know the index of a document) or when very few bits are set.
> I have been working on a disk-based adaptation of {{RoaringDocIdSet}} that
> would still give the ability to know the index of the current document. It
> seems to be only a bit slower than the current implementation on moderately
> sparse fields. However, it also comes with benefits:
> * it is faster in the sparse case when it uses the sparse encoding that uses
> shorts to store doc IDs (when the density is 6% or less)
> * it has faster advance() by large deltas (still linear, but by a factor of
> 65536 so that should always be fine in practice since doc IDs are bound to 2B)
> * it uses O(numDocsWithField) storage rather than O(maxDoc), the worst case
> in 6 bytes per field, which occurs when each range of 65k docs contains
> exactly one document.
> * it is faster if some ranges of documents that share the same 16 upper bits
> are full, this is useful eg. if there is a single document that misses a
> field in the whole index or for use-cases that would store multiple types of
> documents (with different fields) within a single index and would use index
> sorting to put documents of the same type together
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]