[GitHub] lucene-solr pull request #525: LUCENE-8585: Index-time jump-tables for DocVa...

jpountz Fri, 14 Dec 2018 07:32:48 -0800

Github user jpountz commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/525#discussion_r241757056
  
    --- Diff: 
lucene/core/src/java/org/apache/lucene/codecs/lucene80/IndexedDISI.java ---
    @@ -0,0 +1,536 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.lucene.codecs.lucene80;
    +
    +import java.io.DataInput;
    +import java.io.IOException;
    +
    +import org.apache.lucene.search.DocIdSetIterator;
    +import org.apache.lucene.store.IndexInput;
    +import org.apache.lucene.store.IndexOutput;
    +import org.apache.lucene.util.ArrayUtil;
    +import org.apache.lucene.util.BitSetIterator;
    +import org.apache.lucene.util.FixedBitSet;
    +import org.apache.lucene.util.RoaringDocIdSet;
    +
    +/**
    + * Disk-based implementation of a {@link DocIdSetIterator} which can return
    + * the index of the current document, i.e. the ordinal of the current 
document
    + * among the list of documents that this iterator can return. This is 
useful
    + * to implement sparse doc values by only having to encode values for 
documents
    + * that actually have a value.
    + * <p>Implementation-wise, this {@link DocIdSetIterator} is inspired of
    + * {@link RoaringDocIdSet roaring bitmaps} and encodes ranges of {@code 
65536}
    + * documents independently and picks between 3 encodings depending on the
    + * density of the range:<ul>
    + *   <li>{@code ALL} if the range contains 65536 documents exactly,
    + *   <li>{@code DENSE} if the range contains 4096 documents or more; in 
that
    + *       case documents are stored in a bit set,
    + *   <li>{@code SPARSE} otherwise, and the lower 16 bits of the doc IDs are
    + *       stored in a {@link DataInput#readShort() short}.
    + * </ul>
    + * <p>Only ranges that contain at least one value are encoded.
    + * <p>This implementation uses 6 bytes per document in the worst-case, 
which happens
    + * in the case that all ranges contain exactly one document.
    + *
    + * 
    + * To avoid O(n) lookup time complexity, with n being the number of 
documents, two lookup
    + * tables are used:  * A lookup table for block blockCache and index, and 
a rank structure
    + * for DENSE block lookups.
    + *
    + * The lookup table is an array of {@code long}s with an entry for each 
block. It allows for
    + * direct jumping to the block, as opposed to iteration from the current 
position and forward
    + * one block at a time.
    + *
    + * Each long entry consists of 2 logical parts:
    + *
    + * The first 31 bits holds the index (number of set bits in the blocks) up 
to just before the
    + * wanted block. The next 33 bits holds the offset into the underlying 
slice.
    + * As there is a maximum of 2^16 blocks, it follows that the maximum size 
of any block must
    + * not exceed 2^17 bits to avoid overflow. This is currently the case, 
with the largest
    --- End diff --
    
    do you mean bytes rather than bits? Offsets seem to measure bytes?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] lucene-solr pull request #525: LUCENE-8585: Index-time jump-tables for DocVa...

Reply via email to