[ https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717126#comment-16717126 ]
ASF subversion and git services commented on LUCENE-8585: --------------------------------------------------------- Commit 870bb11cc85c961f9ca9fb638143d8189a5abd6a in lucene-solr's branch refs/heads/master from [~toke] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=870bb11 ] Revert "Pre-commit fixes for LUCENE-8374 (JavaDoc + arguments)". LUCENE-8374 was committed without consensus and is expected to be superseded by LUCENE-8585. This reverts commit 6c111611118ceda0837f25a27e5b4549f2693457. > Create jump-tables for DocValues at index-time > ---------------------------------------------- > > Key: LUCENE-8585 > URL: https://issues.apache.org/jira/browse/LUCENE-8585 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: master (8.0) > Reporter: Toke Eskildsen > Priority: Minor > Labels: performance > Attachments: LUCENE-8585.patch, LUCENE-8585.patch, > make_patch_lucene8585.sh > > > As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid > long iterative walks. This is implemented in LUCENE-8374 at search-time > (first request for DocValues from a field in a segment), with the benefit of > working without changes to existing Lucene 7 indexes and the downside of > introducing a startup time penalty and a memory overhead. > As discussed in LUCENE-8374, the codec should be updated to create these > jump-tables at index time. This eliminates the segment-open time & memory > penalties, with the potential downside of increasing index-time for DocValues. > The three elements of LUCENE-8374 should be transferable to index-time > without much alteration of the core structures: > * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for > every 65536 documents, containing the offset of the block in 33 bits and the > index (number of set bits) up to the block in 31 bits. > It can be build sequentially and should be stored as a simple sequence of > consecutive longs for caching of lookups. > As it is fairly small, relative to document count, it might be better to > simply memory cache it? > * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 > bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. > Each {{short}} represents the number of set bits up to right before the > corresponding sub-block of 512 docIDs. > The \{{shorts}} can be computed sequentially or when the DENSE block is > flushed (probably the easiest). They should be stored as a simple sequence of > consecutive shorts for caching of lookups, one logically independent sequence > for each DENSE block. The logical position would be one sequence at the start > of every DENSE block. > Whether it is best to read all the 16 {{shorts}} up front when a DENSE block > is accessed or whether it is best to only read any individual {{short}} when > needed is not clear at this point. > * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric > values. Each {{long}} holds the offset to the corresponding block of values. > The offsets can be computed sequentially and should be stored as a simple > sequence of consecutive {{longs}} for caching of lookups. > The vBPV-offsets has the largest space overhead og the 3 jump-tables and a > lot of the 64 bits in each long are not used for most indexes. They could be > represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, > with the downsides of a potential lookup-time overhead and the need for doing > the compression after all offsets has been determined. > I have no experience with the codec-parts responsible for creating > index-structures. I'm quite willing to take a stab at this, although I > probably won't do much about it before January 2019. Should anyone else wish > to adopt this JIRA-issue or co-work on it, I'll be happy to share. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org