[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405002#comment-17405002 ]
weizijun commented on LUCENE-10033: ----------------------------------- Hi, [~jpountz], [~gsmiller], I run luceneutil: python3 src/python/localrun.py -source wikimedium10k. I change some code in localrun.py: {code:java} import competition import sys # simple example that runs benchmark with WIKI_MEDIUM source and taks files # Baseline here is ../lucene_baseline versus ../lucene_candidate if __name__ == '__main__': sourceData = competition.sourceData() comp = competition.Competition() index = comp.newIndex('lucene_baseline', sourceData, facets = (('taxonomy:Date', 'Date'), ('taxonomy:Month', 'Month'), ('taxonomy:DayOfYear', 'DayOfYear'), ('sortedset:Month', 'Month'), ('sortedset:DayOfYear', 'DayOfYear'))) index_candidate = comp.newIndex('lucene_candidate', sourceData, facets = (('taxonomy:Date', 'Date'), ('taxonomy:Month', 'Month'), ('taxonomy:DayOfYear', 'DayOfYear'), ('sortedset:Month', 'Month'), ('sortedset:DayOfYear', 'DayOfYear'))) #Warning -- Do not break the order of arguments #TODO -- Fix the following by using argparser if len(sys.argv) > 3 and sys.argv[3] == '-concurrentSearches': concurrentSearches = True else: concurrentSearches = False # create a competitor named baseline with sources in the ../trunk folder comp.competitor('baseline', 'lucene_baseline', index = index, concurrentSearches = concurrentSearches) # use the same index here # create a competitor named my_modified_version with sources in the ../patch folder # note that we haven't specified an index here, luceneutil will automatically use the index from the base competitor for searching # while the codec that is used for running this competitor is taken from this competitor. comp.competitor('my_modified_version', 'lucene_candidate', index = index_candidate, concurrentSearches = concurrentSearches) # start the benchmark - this can take long depending on your index and machines comp.benchmark("baseline_vs_patch") {code} The baseline is lucene's master branch. The candidate is the branch from [PR #1|https://github.com/jpountz/lucene/pull/1]. Here is the result: {noformat} TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseMonthSSDVFacets 1958.41 (3.1%) 1538.44 (1.7%) -21.4% ( -25% - -17%) 0.000 BrowseDayOfYearSSDVFacets 1652.84 (2.9%) 1413.20 (1.9%) -14.5% ( -18% - -10%) 0.000 IntNRQ 1510.90 (4.4%) 1481.97 (8.3%) -1.9% ( -13% - 11%) 0.361 HighIntervalsOrdered 557.92 (13.2%) 547.32 (12.6%) -1.9% ( -24% - 27%) 0.642 MedIntervalsOrdered 872.92 (12.3%) 858.58 (11.6%) -1.6% ( -22% - 25%) 0.664 LowIntervalsOrdered 1121.80 (6.3%) 1107.10 (6.7%) -1.3% ( -13% - 12%) 0.526 MedPhrase 530.72 (5.8%) 524.26 (5.8%) -1.2% ( -12% - 11%) 0.507 MedSpanNear 723.07 (3.8%) 714.34 (3.9%) -1.2% ( -8% - 6%) 0.324 LowSloppyPhrase 942.46 (2.9%) 936.04 (3.4%) -0.7% ( -6% - 5%) 0.497 LowPhrase 1131.13 (4.0%) 1128.82 (3.1%) -0.2% ( -7% - 7%) 0.857 OrHighMed 655.21 (13.7%) 655.99 (12.2%) 0.1% ( -22% - 30%) 0.977 PKLookup 229.67 (1.6%) 230.10 (2.0%) 0.2% ( -3% - 3%) 0.754 OrHighLow 634.01 (10.4%) 635.60 (5.9%) 0.3% ( -14% - 18%) 0.925 HighTerm 3600.11 (5.8%) 3611.93 (4.3%) 0.3% ( -9% - 11%) 0.839 HighSloppyPhrase 367.37 (5.1%) 368.73 (5.8%) 0.4% ( -10% - 11%) 0.832 HighSpanNear 421.73 (6.5%) 423.96 (5.8%) 0.5% ( -11% - 13%) 0.787 HighTermDayOfYearSort 2533.62 (7.7%) 2549.91 (7.3%) 0.6% ( -13% - 16%) 0.786 LowSpanNear 497.84 (5.5%) 502.07 (3.3%) 0.8% ( -7% - 10%) 0.553 Respell 266.07 (12.8%) 268.61 (12.2%) 1.0% ( -21% - 29%) 0.809 HighPhrase 622.36 (6.2%) 629.01 (7.7%) 1.1% ( -12% - 15%) 0.629 AndHighMed 854.35 (5.2%) 865.51 (3.7%) 1.3% ( -7% - 10%) 0.360 BrowseMonthTaxoFacets 3057.03 (5.9%) 3097.61 (4.9%) 1.3% ( -8% - 12%) 0.436 BrowseDayOfYearTaxoFacets 2399.39 (5.0%) 2432.18 (4.0%) 1.4% ( -7% - 10%) 0.336 HighTermMonthSort 2564.47 (6.1%) 2607.36 (4.7%) 1.7% ( -8% - 13%) 0.330 Fuzzy1 306.10 (7.1%) 311.26 (7.0%) 1.7% ( -11% - 17%) 0.451 LowTerm 3912.29 (4.3%) 3979.32 (6.2%) 1.7% ( -8% - 12%) 0.309 OrHighHigh 480.12 (7.7%) 488.87 (7.5%) 1.8% ( -12% - 18%) 0.447 Prefix3 471.26 (15.0%) 480.73 (15.7%) 2.0% ( -24% - 38%) 0.679 BrowseDateTaxoFacets 2721.15 (4.8%) 2777.53 (4.5%) 2.1% ( -6% - 12%) 0.163 AndHighHigh 1023.76 (7.7%) 1045.20 (7.3%) 2.1% ( -11% - 18%) 0.377 MedTerm 3813.24 (5.5%) 3898.93 (5.5%) 2.2% ( -8% - 14%) 0.198 Fuzzy2 102.35 (12.7%) 104.67 (15.1%) 2.3% ( -22% - 34%) 0.608 AndHighLow 3004.48 (5.9%) 3073.96 (6.7%) 2.3% ( -9% - 15%) 0.246 MedSloppyPhrase 591.31 (4.2%) 605.55 (3.6%) 2.4% ( -5% - 10%) 0.050 Wildcard 544.95 (12.9%) 577.08 (7.2%) 5.9% ( -12% - 29%) 0.074 {noformat} And the whole result is from the Attachment: [^benchmark] > Encode doc values in smaller blocks of values, like postings > ------------------------------------------------------------ > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > Attachments: benchmark > > Time Spent: 1h > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org