[ 
https://issues.apache.org/jira/browse/LUCENE-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712878#comment-16712878
 ] 

Toke Eskildsen commented on LUCENE-8585:
----------------------------------------

Thank you for the clarifications, [~jpountz].

Regarding where to put the jump-data:
{quote}If the access pattern is sequential, which I assume would be the case in 
both cases, then it's fine to keep them on storage.
{quote}
Well, that really depends on the access pattern from the outside ;). But as the 
jump-entries are stored sequentially then a request hitting a smaller subset of 
the documents in a manner that will benefit from jumps means that the 
jump-entries will be accessed in increasing order. They won't be used if the 
jumps are within the current block or to the block immediately following the 
current one.
{quote}We can also move the 7.0 format to lucene/backward-codecs since 
lucene/core only keeps formats that are used for the current codec.
{quote}
Before I began there was a single file {{Lucene80Codec.java}} in the 
{{lucene80}} package, picking codec-parts from both 50, 60 and 70. After having 
implemented the jumps, I have not touched the {{Lucene70Norms*}}-part. I 
_guess_ I should move the {{Lucene70DocValues*}}-files from {{lucene70}} to 
{{backward-codecs}}, leaving the norms-classes?

Since the norms-classes also uses {{IndexedDISI}}, I expect it would be best to 
upgrade them too. This would leave the core {{lucene70}} folder empty of active 
code.
{quote}If you move the 7.0 format to lucene/backward-codecs, then you'll need 
to move it to 
lucene/backward-codecs/src/resources/META-INF/services/org.apache.lucene.codecs.DocValuesFormat.
{quote}
That makes sense, thanks!

> Create jump-tables for DocValues at index-time
> ----------------------------------------------
>
>                 Key: LUCENE-8585
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8585
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: master (8.0)
>            Reporter: Toke Eskildsen
>            Priority: Minor
>              Labels: performance
>         Attachments: LUCENE-8585.patch, make_patch_lucene8585.sh
>
>
> As noted in LUCENE-7589, lookup of DocValues should use jump-tables to avoid 
> long iterative walks. This is implemented in LUCENE-8374 at search-time 
> (first request for DocValues from a field in a segment), with the benefit of 
> working without changes to existing Lucene 7 indexes and the downside of 
> introducing a startup time penalty and a memory overhead.
> As discussed in LUCENE-8374, the codec should be updated to create these 
> jump-tables at index time. This eliminates the segment-open time & memory 
> penalties, with the potential downside of increasing index-time for DocValues.
> The three elements of LUCENE-8374 should be transferable to index-time 
> without much alteration of the core structures:
>  * {{IndexedDISI}} block offset and index skips: A {{long}} (64 bits) for 
> every 65536 documents, containing the offset of the block in 33 bits and the 
> index (number of set bits) up to the block in 31 bits.
>  It can be build sequentially and should be stored as a simple sequence of 
> consecutive longs for caching of lookups.
>  As it is fairly small, relative to document count, it might be better to 
> simply memory cache it?
>  * {{IndexedDISI}} DENSE (> 4095, < 65536 set bits) blocks: A {{short}} (16 
> bits) for every 8 {{longs}} (512 bits) for a total of 256 bytes/DENSE_block. 
> Each {{short}} represents the number of set bits up to right before the 
> corresponding sub-block of 512 docIDs.
>  The \{{shorts}} can be computed sequentially or when the DENSE block is 
> flushed (probably the easiest). They should be stored as a simple sequence of 
> consecutive shorts for caching of lookups, one logically independent sequence 
> for each DENSE block. The logical position would be one sequence at the start 
> of every DENSE block.
>  Whether it is best to read all the 16 {{shorts}} up front when a DENSE block 
> is accessed or whether it is best to only read any individual {{short}} when 
> needed is not clear at this point.
>  * Variable Bits Per Value: A {{long}} (64 bits) for every 16384 numeric 
> values. Each {{long}} holds the offset to the corresponding block of values.
>  The offsets can be computed sequentially and should be stored as a simple 
> sequence of consecutive {{longs}} for caching of lookups.
>  The vBPV-offsets has the largest space overhead og the 3 jump-tables and a 
> lot of the 64 bits in each long are not used for most indexes. They could be 
> represented as a simple {{PackedInts}} sequence or {{MonotonicLongValues}}, 
> with the downsides of a potential lookup-time overhead and the need for doing 
> the compression after all offsets has been determined.
> I have no experience with the codec-parts responsible for creating 
> index-structures. I'm quite willing to take a stab at this, although I 
> probably won't do much about it before January 2019. Should anyone else wish 
> to adopt this JIRA-issue or co-work on it, I'll be happy to share.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to