[jira] [Commented] (LUCENE-10536) Doc values terms dicts should use the first term of each block as a dictionary

ASF subversion and git services (Jira) Thu, 12 May 2022 01:37:23 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535971#comment-17535971
 ]


ASF subversion and git services commented on LUCENE-10536:
----------------------------------------------------------

Commit 1677be091ec9f3775d74bd927b8bfd4fea4d383d in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1677be091ec ]

LUCENE-10536: Slightly better compression of doc values' terms dictionaries. 
(#838)

Doc values terms dictionaries keep the first term of each block uncompressed so
that they can somewhat efficiently perform binary searches across blocks.
Suffixes of the other 63 terms are compressed together using LZ4 to leverage
redundancy across suffixes. This change improves compression a bit by using the
first (uncompressed) term of each block as a dictionary when compressing
suffixes of the 63 other terms. This helps with compressing the first few
suffixes when there's not much context yet that can be leveraged to find
duplicates.


> Doc values terms dicts should use the first term of each block as a dictionary
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-10536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10536
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Doc values terms dictionaries split data into blocks of 64 terms, where the 
> first term is written uncompressed (which is useful for binary searches), and 
> the 63 other terms are encoded by taking the difference with the previous 
> term and compressing all suffixes together with LZ4.
> With this format, the suffix of the second term is also unlikely to benefit 
> from any compression, since it doesn't have data to search for duplicate 
> bytes into besides itself. A minor improvement we could make would consist of 
> using the first term as a dictionary for suffixes of terms 2..64.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10536) Doc values terms dicts should use the first term of each block as a dictionary

Reply via email to