[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266811#comment-17266811 ]
Jaison.Bi commented on LUCENE-9663: ----------------------------------- The benchmark result from luceneutil(source: wikimedium10m) does not show obvious reduction after this change: ||TaskQPS||baseline||StdDevQPS||my_modified_version||StdDev||Pct diff||p-value|| |Fuzzy2|69.46|(13.6%)|64.58|(17.9%)|-7.0% ( -33% - 28%)|0.197| |OrHighMed|53.91|(3.5%)|52.69|(4.0%)|-2.3% ( -9% - 5%)|0.078| |OrHighHigh|26.47|(3.7%)|26.00|(2.8%)|-1.8% ( -7% - 4%)|0.112| |Fuzzy1|77.94|(11.0%)|76.62|(11.2%)|-1.7% ( -21% - 23%)|0.656| |Prefix3|91.45|(3.0%)|90.82|(4.3%)|-0.7% ( -7% - 6%)|0.588| |LowTerm|1411.91|(5.7%)|1402.69|(5.0%)|-0.7% ( -10% - 10%)|0.722| |MedPhrase|168.82|(3.7%)|168.37|(3.8%)|-0.3% ( -7% - 7%)|0.832| |OrHighLow|544.05|(7.3%)|543.19|(8.3%)|-0.2% ( -14% - 16%)|0.953| |LowSloppyPhrase|19.33|(2.6%)|19.37|(3.5%)|0.2% ( -5% - 6%)|0.858| |HighSpanNear|3.20|(2.5%)|3.21|(4.2%)|0.2% ( -6% - 7%)|0.871| |Wildcard|129.32|(6.0%)|129.58|(3.9%)|0.2% ( -9% - 10%)|0.910| |PKLookup|202.45|(3.2%)|202.85|(3.3%)|0.2% ( -6% - 6%)|0.859| |BrowseDayOfYearSSDVFacets|14.23|(2.2%)|14.26|(2.4%)|0.3% ( -4% - 4%)|0.749| |MedSpanNear|184.22|(3.4%)|184.76|(4.6%)|0.3% ( -7% - 8%)|0.832| |HighIntervalsOrdered|13.82|(2.0%)|13.89|(2.8%)|0.5% ( -4% - 5%)|0.519| |HighTermTitleBDVSort|93.32|(12.6%)|93.87|(12.0%)|0.6% ( -21% - 28%)|0.889| |HighTermDayOfYearSort|74.63|(10.7%)|75.08|(12.3%)|0.6% ( -20% - 26%)|0.878| |MedSloppyPhrase|129.10|(2.5%)|129.89|(4.2%)|0.6% ( -5% - 7%)|0.611| |HighPhrase|19.91|(3.1%)|20.03|(2.8%)|0.6% ( -5% - 6%)|0.552| |HighSloppyPhrase|21.03|(2.1%)|21.16|(3.5%)|0.6% ( -4% - 6%)|0.524| |Respell|52.62|(4.2%)|52.97|(2.6%)|0.7% ( -5% - 7%)|0.588| |TermDTSort|240.48|(13.1%)|242.13|(12.7%)|0.7% ( -22% - 30%)|0.876| |IntNRQ|113.26|(3.3%)|114.07|(3.3%)|0.7% ( -5% - 7%)|0.527| |AndHighHigh|53.15|(3.8%)|53.55|(3.7%)|0.8% ( -6% - 8%)|0.553| |LowSpanNear|22.72|(2.5%)|22.92|(2.8%)|0.8% ( -4% - 6%)|0.349| |MedTerm|1383.09|(3.9%)|1399.20|(5.4%)|1.2% ( -7% - 10%)|0.474| |BrowseDayOfYearTaxoFacets|3.09|(5.2%)|3.14|(4.5%)|1.4% ( -7% - 11%)|0.401| |HighTermMonthSort|92.89|(16.9%)|94.23|(17.4%)|1.4% ( -28% - 42%)|0.807| |AndHighMed|278.15|(4.2%)|282.18|(4.7%)|1.4% ( -7% - 10%)|0.345| |BrowseDateTaxoFacets|3.09|(5.2%)|3.14|(4.4%)|1.6% ( -7% - 11%)|0.330| |BrowseMonthTaxoFacets|3.39|(6.0%)|3.44|(5.1%)|1.6% ( -8% - 13%)|0.398| |BrowseMonthSSDVFacets|15.74|(6.4%)|16.00|(3.3%)|1.7% ( -7% - 12%)|0.337| |LowPhrase|319.40|(3.5%)|324.87|(5.1%)|1.7% ( -6% - 10%)|0.252| |AndHighLow|730.59|(4.6%)|744.60|(4.8%)|1.9% ( -7% - 11%)|0.238| |OrNotHighLow|660.02|(5.7%)|673.32|(3.9%)|2.0% ( -7% - 12%)|0.231| |HighTerm|1289.67|(4.6%)|1316.15|(4.9%)|2.1% ( -7% - 12%)|0.210| |OrHighNotMed|691.04|(7.0%)|711.12|(5.6%)|2.9% ( -9% - 16%)|0.182| |OrHighNotHigh|610.79|(8.2%)|631.12|(5.8%)|3.3% ( -9% - 18%)|0.171| |OrNotHighMed|637.03|(6.9%)|658.85|(7.0%)|3.4% ( -9% - 18%)|0.152| |OrNotHighHigh|599.42|(5.9%)|620.44|(5.3%)|3.5% ( -7% - 15%)|0.070| |OrHighNotLow|861.26|(6.1%)|912.07|(7.8%)|5.9% ( -7% - 21%)|0.014| I also wrote another benchmark test for faceting test: * 25 SortedSet fields per document. * Some high-cardinality fields and average-value length are defined as below: ||Field Name||Cardinality||Avg Value Length|| |reqid|3772370|69| |extend|3758007|343| |url|3623677|61| |obj|3599083|57| |uploadtime|1064012|136| |host|2418|12| This benchmark tests focus on the latency of building SortedSetDocValuesReaderState(Will read TermsDict) and getting top 10 children. See below results: ||Benchmark||Mode||Cnt||Score||Error||Units|| |SortedSetFacetBenchmark.testBuildReaderState_After|avgt|15|32772.487|± 926.056|ms/op| |SortedSetFacetBenchmark.testBuildReaderState_Before|avgt|15|19462.099|± 906.832|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_extend_After|avgt|15|1575.330|± 22.725|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_extend_Before|avgt|15|1559.596|± 18.216|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_host_After|avgt|15|1599.762|± 81.167|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_host_Before|avgt|15|1573.225|± 25.173|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_obj_After|avgt|15|1578.812|± 19.121|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_obj_Before|avgt|15|1578.499|± 16.796|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_reqid_After|avgt|15|1575.300|± 13.651|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_reqid_Before|avgt|15|1562.115|± 27.098|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_uploadtime_After|avgt|15|1560.106|± 18.756|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_uploadtime_Before|avgt|15|1556.131|± 14.161|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_url_After|avgt|15|1568.535|± 23.545|ms/op| |SortedSetFacetBenchmark.testGetTop10Results_url_Before|avgt|15|1554.675|± 23.721|ms/op| So the operations read TermsDict will get obvious slow, but the queries based on ordinal values will not be affected. > Adding compression to terms dict from SortedSet/Sorted DocValues > ---------------------------------------------------------------- > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Reporter: Jaison.Bi > Priority: Trivial > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org