[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

Jaison.Bi (Jira) Sun, 17 Jan 2021 18:52:13 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266976#comment-17266976
 ]


Jaison.Bi commented on LUCENE-9663:
-----------------------------------

Thanks for the comment, [~mikemccand]

I added one benchmark test to compare the diff of building OridinalMap.
 Still using the data mentioned in previous comment. Each index contains 4 
segments. 
 Index directory size:   
||Before||After||
|6.23 GB|5.38 GB|

(I didnot count dvd file size since compound file exist)

See below results:
||Benchmark||Mode||Cnt||Score||Error||Units||
|BuildOrdinalMapBenchmark.buildOrdinalMap_extend_After|avgt|15|2120.204|± 
111.956|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_extend_Before|avgt|15|1217.172|± 
57.555|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_host_After|avgt|15|4.775|± 
0.260|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_host_Before|avgt|15|4.667|± 
0.154|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_obj_After|avgt|15|670.785|± 
52.170|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_obj_Before|avgt|15|557.300|± 
80.592|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_reqid_After|avgt|15|876.092|± 
112.798|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_reqid_Before|avgt|15|515.775|± 
61.233|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_uploadtime_After|avgt|15|167.986|± 
5.600|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_uploadtime_Before|avgt|15|162.752|± 
1.934|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_url_After|avgt|15|667.657|± 
18.655|ms/op|
|BuildOrdinalMapBenchmark.buildOrdinalMap_url_Before|avgt|15|524.013|± 
27.244|ms/op|

 

> Adding compression to terms dict from SortedSet/Sorted DocValues
> ----------------------------------------------------------------
>
>                 Key: LUCENE-9663
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9663
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Jaison.Bi
>            Priority: Trivial
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

Reply via email to