[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

Jaison.Bi (Jira) Sun, 17 Jan 2021 07:56:05 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266811#comment-17266811
 ]


Jaison.Bi commented on LUCENE-9663:
-----------------------------------

The benchmark result from luceneutil(source: wikimedium10m) does not show 
obvious reduction after this change:
||TaskQPS||baseline||StdDevQPS||my_modified_version||StdDev||Pct diff||p-value||
|Fuzzy2|69.46|(13.6%)|64.58|(17.9%)|-7.0% ( -33% - 28%)|0.197|
|OrHighMed|53.91|(3.5%)|52.69|(4.0%)|-2.3% ( -9% - 5%)|0.078|
|OrHighHigh|26.47|(3.7%)|26.00|(2.8%)|-1.8% ( -7% - 4%)|0.112|
|Fuzzy1|77.94|(11.0%)|76.62|(11.2%)|-1.7% ( -21% - 23%)|0.656|
|Prefix3|91.45|(3.0%)|90.82|(4.3%)|-0.7% ( -7% - 6%)|0.588|
|LowTerm|1411.91|(5.7%)|1402.69|(5.0%)|-0.7% ( -10% - 10%)|0.722|
|MedPhrase|168.82|(3.7%)|168.37|(3.8%)|-0.3% ( -7% - 7%)|0.832|
|OrHighLow|544.05|(7.3%)|543.19|(8.3%)|-0.2% ( -14% - 16%)|0.953|
|LowSloppyPhrase|19.33|(2.6%)|19.37|(3.5%)|0.2% ( -5% - 6%)|0.858|
|HighSpanNear|3.20|(2.5%)|3.21|(4.2%)|0.2% ( -6% - 7%)|0.871|
|Wildcard|129.32|(6.0%)|129.58|(3.9%)|0.2% ( -9% - 10%)|0.910|
|PKLookup|202.45|(3.2%)|202.85|(3.3%)|0.2% ( -6% - 6%)|0.859|
|BrowseDayOfYearSSDVFacets|14.23|(2.2%)|14.26|(2.4%)|0.3% ( -4% - 4%)|0.749|
|MedSpanNear|184.22|(3.4%)|184.76|(4.6%)|0.3% ( -7% - 8%)|0.832|
|HighIntervalsOrdered|13.82|(2.0%)|13.89|(2.8%)|0.5% ( -4% - 5%)|0.519|
|HighTermTitleBDVSort|93.32|(12.6%)|93.87|(12.0%)|0.6% ( -21% - 28%)|0.889|
|HighTermDayOfYearSort|74.63|(10.7%)|75.08|(12.3%)|0.6% ( -20% - 26%)|0.878|
|MedSloppyPhrase|129.10|(2.5%)|129.89|(4.2%)|0.6% ( -5% - 7%)|0.611|
|HighPhrase|19.91|(3.1%)|20.03|(2.8%)|0.6% ( -5% - 6%)|0.552|
|HighSloppyPhrase|21.03|(2.1%)|21.16|(3.5%)|0.6% ( -4% - 6%)|0.524|
|Respell|52.62|(4.2%)|52.97|(2.6%)|0.7% ( -5% - 7%)|0.588|
|TermDTSort|240.48|(13.1%)|242.13|(12.7%)|0.7% ( -22% - 30%)|0.876|
|IntNRQ|113.26|(3.3%)|114.07|(3.3%)|0.7% ( -5% - 7%)|0.527|
|AndHighHigh|53.15|(3.8%)|53.55|(3.7%)|0.8% ( -6% - 8%)|0.553|
|LowSpanNear|22.72|(2.5%)|22.92|(2.8%)|0.8% ( -4% - 6%)|0.349|
|MedTerm|1383.09|(3.9%)|1399.20|(5.4%)|1.2% ( -7% - 10%)|0.474|
|BrowseDayOfYearTaxoFacets|3.09|(5.2%)|3.14|(4.5%)|1.4% ( -7% - 11%)|0.401|
|HighTermMonthSort|92.89|(16.9%)|94.23|(17.4%)|1.4% ( -28% - 42%)|0.807|
|AndHighMed|278.15|(4.2%)|282.18|(4.7%)|1.4% ( -7% - 10%)|0.345|
|BrowseDateTaxoFacets|3.09|(5.2%)|3.14|(4.4%)|1.6% ( -7% - 11%)|0.330|
|BrowseMonthTaxoFacets|3.39|(6.0%)|3.44|(5.1%)|1.6% ( -8% - 13%)|0.398|
|BrowseMonthSSDVFacets|15.74|(6.4%)|16.00|(3.3%)|1.7% ( -7% - 12%)|0.337|
|LowPhrase|319.40|(3.5%)|324.87|(5.1%)|1.7% ( -6% - 10%)|0.252|
|AndHighLow|730.59|(4.6%)|744.60|(4.8%)|1.9% ( -7% - 11%)|0.238|
|OrNotHighLow|660.02|(5.7%)|673.32|(3.9%)|2.0% ( -7% - 12%)|0.231|
|HighTerm|1289.67|(4.6%)|1316.15|(4.9%)|2.1% ( -7% - 12%)|0.210|
|OrHighNotMed|691.04|(7.0%)|711.12|(5.6%)|2.9% ( -9% - 16%)|0.182|
|OrHighNotHigh|610.79|(8.2%)|631.12|(5.8%)|3.3% ( -9% - 18%)|0.171|
|OrNotHighMed|637.03|(6.9%)|658.85|(7.0%)|3.4% ( -9% - 18%)|0.152|
|OrNotHighHigh|599.42|(5.9%)|620.44|(5.3%)|3.5% ( -7% - 15%)|0.070|
|OrHighNotLow|861.26|(6.1%)|912.07|(7.8%)|5.9% ( -7% - 21%)|0.014|

 

I also wrote another benchmark test for faceting test:
 * 25 SortedSet fields per document.
 * Some high-cardinality fields and average-value length are defined as below:

||Field Name||Cardinality||Avg Value Length||
|reqid|3772370|69|
|extend|3758007|343|
|url|3623677|61|
|obj|3599083|57|
|uploadtime|1064012|136|
|host|2418|12|

This benchmark tests focus on the latency of building 
SortedSetDocValuesReaderState(Will read TermsDict) and getting top 10 children. 
See below results:
||Benchmark||Mode||Cnt||Score||Error||Units||
|SortedSetFacetBenchmark.testBuildReaderState_After|avgt|15|32772.487|± 
926.056|ms/op|
|SortedSetFacetBenchmark.testBuildReaderState_Before|avgt|15|19462.099|± 
906.832|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_extend_After|avgt|15|1575.330|± 
22.725|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_extend_Before|avgt|15|1559.596|± 
18.216|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_host_After|avgt|15|1599.762|± 
81.167|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_host_Before|avgt|15|1573.225|± 
25.173|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_obj_After|avgt|15|1578.812|± 
19.121|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_obj_Before|avgt|15|1578.499|± 
16.796|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_reqid_After|avgt|15|1575.300|± 
13.651|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_reqid_Before|avgt|15|1562.115|± 
27.098|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_uploadtime_After|avgt|15|1560.106|±
 18.756|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_uploadtime_Before|avgt|15|1556.131|±
 14.161|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_url_After|avgt|15|1568.535|± 
23.545|ms/op|
|SortedSetFacetBenchmark.testGetTop10Results_url_Before|avgt|15|1554.675|± 
23.721|ms/op|

 So the operations read TermsDict will get obvious slow, but the queries based 
on ordinal values will not be affected.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> ----------------------------------------------------------------
>
>                 Key: LUCENE-9663
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9663
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Jaison.Bi
>            Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

Reply via email to