[ 
https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712534#comment-16712534
 ] 

Adrien Grand commented on LUCENE-8585:
--------------------------------------

bq. unsure if it is best to load the the IndexedDISI and vBPV jump-tables to 
heap or keep them on storage

If the access pattern is sequential, which I assume would be the case in both 
cases, then it's fine to keep them on storage.

bq. I have copied LuceneDocValuesFormat, LuceneDocValuesConsumer, 
LuceneDocValuesProducer and IndexedDISI from lucene70 to lucene80 and made the 
changes there. The test-class TestIndexedDISI is specifically for the 7.0 
version due to package privacy, so I copied that one too. Was that correct? It 
seems a bit clunky to have a copy of a class with only a few methods added, but 
on the other hand it does work well for separation of codec versions.

This is the correct approach indeed. We can also move the 7.0 format to 
lucene/backward-codecs since lucene/core only keeps formats that are used for 
the current codec.

bq. I kept the old org.apache.lucene.codecs.lucene70.Lucene70DocValuesFormat in 
the file. Was that the right way to do it?

If you move the 7.0 format to lucene/backward-codecs, then you'll need to move 
it to 
lucene/backward-codecs/src/resources/META-INF/services/org.apache.lucene.codecs.DocValuesFormat.

> Create jump-tables for DocValues at index-time
> ----------------------------------------------
>
>                 Key: LUCENE-8585
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8585
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: master (8.0)
>            Reporter: Toke Eskildsen
>            Priority: Minor
>              Labels: performance
>         Attachments: make_patch_lucene8585.sh
>
>
> As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid 
> long iterative walks. This is implemented in LUCENE-8374 at search-time 
> (first request for DocValues from a field in a segment), with the benefit of 
> working without changes to existing Lucene 7 indexes and the downside of 
> introducing a startup time penalty and a memory overhead.
> As discussed in LUCENE-8374, the codec should be updated to create these 
> jump-tables at index time. This eliminates the segment-open time & memory 
> penalties, with the potential downside of increasing index-time for DocValues.
> The three elements of LUCENE-8374 should be transferable to index-time 
> without much alteration of the core structures:
>  * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for 
> every 65536 documents, containing the offset of the block in 33 bits and the 
> index (number of set bits) up to the block in 31 bits.
>  It can be build sequentially and should be stored as a simple sequence of 
> consecutive longs for caching of lookups.
>  As it is fairly small, relative to document count, it might be better to 
> simply memory cache it?
>  * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 
> bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. 
> Each {{short}} represents the number of set bits up to right before the 
> corresponding sub-block of 512 docIDs.
>  The \{{shorts}} can be computed sequentially or when the DENSE block is 
> flushed (probably the easiest). They should be stored as a simple sequence of 
> consecutive shorts for caching of lookups, one logically independent sequence 
> for each DENSE block. The logical position would be one sequence at the start 
> of every DENSE block.
>  Whether it is best to read all the 16 {{shorts}} up front when a DENSE block 
> is accessed or whether it is best to only read any individual {{short}} when 
> needed is not clear at this point.
>  * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric 
> values. Each {{long}} holds the offset to the corresponding block of values.
>  The offsets can be computed sequentially and should be stored as a simple 
> sequence of consecutive {{longs}} for caching of lookups.
>  The vBPV-offsets has the largest space overhead og the 3 jump-tables and a 
> lot of the 64 bits in each long are not used for most indexes. They could be 
> represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, 
> with the downsides of a potential lookup-time overhead and the need for doing 
> the compression after all offsets has been determined.
> I have no experience with the codec-parts responsible for creating 
> index-structures. I'm quite willing to take a stab at this, although I 
> probably won't do much about it before January 2019. Should anyone else wish 
> to adopt this JIRA-issue or co-work on it, I'll be happy to share.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to