[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264171#comment-17264171
 ] 

Adrien Grand commented on LUCENE-9663:
--------------------------------------

+1 to add lightweight compression to doc-value terms dictionaries. I've seen 
users store things like unique URLs in sorted doc-value fields where 
compressing suffixes would have helped.

I agree with Jaison that the query impact should be negligible since faceting 
typically bottlenecks on reading ordinals, not terms dictionaries, though we 
should double check. :) Also +1 to test how slower building an OrdinalMap gets 
with this change.

bq. replacing prefix-compression with LZ4

My intuition is that it would actually be better to do LZ4 in addition to 
prefix compression, like we do for the terms dictionary of the inverted index.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> ----------------------------------------------------------------
>
>                 Key: LUCENE-9663
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9663
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Jaison.Bi
>            Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to