Github user tokee commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/525#discussion_r244328885
--- Diff:
lucene/core/src/java/org/apache/lucene/codecs/lucene80/IndexedDISI.java ---
@@ -0,0 +1,542 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.codecs.lucene80;
+
+import java.io.DataInput;
+import java.io.IOException;
+
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.store.IndexOutput;
+import org.apache.lucene.store.RandomAccessInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BitSetIterator;
+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RoaringDocIdSet;
+
+/**
+ * Disk-based implementation of a {@link DocIdSetIterator} which can return
+ * the index of the current document, i.e. the ordinal of the current
document
+ * among the list of documents that this iterator can return. This is
useful
+ * to implement sparse doc values by only having to encode values for
documents
+ * that actually have a value.
+ * <p>Implementation-wise, this {@link DocIdSetIterator} is inspired of
+ * {@link RoaringDocIdSet roaring bitmaps} and encodes ranges of {@code
65536}
+ * documents independently and picks between 3 encodings depending on the
+ * density of the range:<ul>
+ * <li>{@code ALL} if the range contains 65536 documents exactly,
+ * <li>{@code DENSE} if the range contains 4096 documents or more; in
that
+ * case documents are stored in a bit set,
+ * <li>{@code SPARSE} otherwise, and the lower 16 bits of the doc IDs are
+ * stored in a {@link DataInput#readShort() short}.
+ * </ul>
+ * <p>Only ranges that contain at least one value are encoded.
+ * <p>This implementation uses 6 bytes per document in the worst-case,
which happens
+ * in the case that all ranges contain exactly one document.
+ *
+ *
+ * To avoid O(n) lookup time complexity, with n being the number of
documents, two lookup
+ * tables are used: A lookup table for block blockCache and index, and a
rank structure
+ * for DENSE block lookups.
+ *
+ * The lookup table is an array of {@code long}s with an entry for each
block. It allows for
+ * direct jumping to the block, as opposed to iteration from the current
position and forward
+ * one block at a time.
+ *
+ * Each long entry consists of 2 logical parts:
+ *
+ * The first 31 bits hold the index (number of set bits in the blocks) up
to just before the
+ * wanted block. The next 33 bits holds the offset in bytes into the
underlying slice.
+ * As there is a maximum of 2^16 blocks, it follows that the maximum size
of any block must
+ * not exceed 2^17 bits to avoid overflow. This is currently the case,
with the largest
+ * block being DENSE and using 2^16 + 288 bits, and is likely to continue
to hold as using
+ * more than double the amount of bits is unlikely to be an efficient
representation.
+ * The cache overhead is numDocs/1024 bytes.
--- End diff --
Nice catch. That was me mixing bits & bytes and arriving at too harsh a
requirement, making the representation needlessly complicated: All block types
are < 2^17 *bits*, but the offset is in *bytes*, so we do not require 2^16 *
2^17 bits to hold it; only 2^16 * 2^17 / 2^3 = 30 bits.
With the offset-requirement lowered from 33 to 30 bits, it is much more
natural to represent offset & index as two ints. This will be in the next
commit.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]