[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

weizijun (Jira) Wed, 25 Aug 2021 23:55:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405002#comment-17405002
 ]


weizijun commented on LUCENE-10033:
-----------------------------------

Hi, [~jpountz], [~gsmiller], I run luceneutil: python3 src/python/localrun.py 
-source wikimedium10k. I change some code in localrun.py:
{code:java}
import competition
import sys

# simple example that runs benchmark with WIKI_MEDIUM source and taks files
# Baseline here is ../lucene_baseline versus ../lucene_candidate
if __name__ == '__main__':
  sourceData = competition.sourceData()
  comp =  competition.Competition()

  index = comp.newIndex('lucene_baseline', sourceData,
                        facets = (('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear')))

  index_candidate = comp.newIndex('lucene_candidate', sourceData,
                        facets = (('taxonomy:Date', 'Date'),
                                  ('taxonomy:Month', 'Month'),
                                  ('taxonomy:DayOfYear', 'DayOfYear'),
                                  ('sortedset:Month', 'Month'),
                                  ('sortedset:DayOfYear', 'DayOfYear')))

  #Warning -- Do not break the order of arguments
  #TODO -- Fix the following by using argparser
  if len(sys.argv) > 3 and sys.argv[3] == '-concurrentSearches':
    concurrentSearches = True
  else:
    concurrentSearches = False

  # create a competitor named baseline with sources in the ../trunk folder
  comp.competitor('baseline', 'lucene_baseline',
                  index = index, concurrentSearches = concurrentSearches)

  # use the same index here
  # create a competitor named my_modified_version with sources in the ../patch 
folder
  # note that we haven't specified an index here, luceneutil will automatically 
use the index from the base competitor for searching
  # while the codec that is used for running this competitor is taken from this 
competitor.
  comp.competitor('my_modified_version', 'lucene_candidate',
                  index = index_candidate, concurrentSearches = 
concurrentSearches)

  # start the benchmark - this can take long depending on your index and 
machines
  comp.benchmark("baseline_vs_patch")
{code}
The baseline is lucene's master branch. The candidate is the branch from [PR 
#1|https://github.com/jpountz/lucene/pull/1].
 Here is the result:
{noformat}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff p-value
   BrowseMonthSSDVFacets     1958.41      (3.1%)     1538.44      (1.7%)  
-21.4% ( -25% -  -17%) 0.000
BrowseDayOfYearSSDVFacets     1652.84      (2.9%)     1413.20      (1.9%)  
-14.5% ( -18% -  -10%) 0.000
                  IntNRQ     1510.90      (4.4%)     1481.97      (8.3%)   
-1.9% ( -13% -   11%) 0.361
    HighIntervalsOrdered      557.92     (13.2%)      547.32     (12.6%)   
-1.9% ( -24% -   27%) 0.642
     MedIntervalsOrdered      872.92     (12.3%)      858.58     (11.6%)   
-1.6% ( -22% -   25%) 0.664
     LowIntervalsOrdered     1121.80      (6.3%)     1107.10      (6.7%)   
-1.3% ( -13% -   12%) 0.526
               MedPhrase      530.72      (5.8%)      524.26      (5.8%)   
-1.2% ( -12% -   11%) 0.507
             MedSpanNear      723.07      (3.8%)      714.34      (3.9%)   
-1.2% (  -8% -    6%) 0.324
         LowSloppyPhrase      942.46      (2.9%)      936.04      (3.4%)   
-0.7% (  -6% -    5%) 0.497
               LowPhrase     1131.13      (4.0%)     1128.82      (3.1%)   
-0.2% (  -7% -    7%) 0.857
               OrHighMed      655.21     (13.7%)      655.99     (12.2%)    
0.1% ( -22% -   30%) 0.977
                PKLookup      229.67      (1.6%)      230.10      (2.0%)    
0.2% (  -3% -    3%) 0.754
               OrHighLow      634.01     (10.4%)      635.60      (5.9%)    
0.3% ( -14% -   18%) 0.925
                HighTerm     3600.11      (5.8%)     3611.93      (4.3%)    
0.3% (  -9% -   11%) 0.839
        HighSloppyPhrase      367.37      (5.1%)      368.73      (5.8%)    
0.4% ( -10% -   11%) 0.832
            HighSpanNear      421.73      (6.5%)      423.96      (5.8%)    
0.5% ( -11% -   13%) 0.787
   HighTermDayOfYearSort     2533.62      (7.7%)     2549.91      (7.3%)    
0.6% ( -13% -   16%) 0.786
             LowSpanNear      497.84      (5.5%)      502.07      (3.3%)    
0.8% (  -7% -   10%) 0.553
                 Respell      266.07     (12.8%)      268.61     (12.2%)    
1.0% ( -21% -   29%) 0.809
              HighPhrase      622.36      (6.2%)      629.01      (7.7%)    
1.1% ( -12% -   15%) 0.629
              AndHighMed      854.35      (5.2%)      865.51      (3.7%)    
1.3% (  -7% -   10%) 0.360
   BrowseMonthTaxoFacets     3057.03      (5.9%)     3097.61      (4.9%)    
1.3% (  -8% -   12%) 0.436
BrowseDayOfYearTaxoFacets     2399.39      (5.0%)     2432.18      (4.0%)    
1.4% (  -7% -   10%) 0.336
       HighTermMonthSort     2564.47      (6.1%)     2607.36      (4.7%)    
1.7% (  -8% -   13%) 0.330
                  Fuzzy1      306.10      (7.1%)      311.26      (7.0%)    
1.7% ( -11% -   17%) 0.451
                 LowTerm     3912.29      (4.3%)     3979.32      (6.2%)    
1.7% (  -8% -   12%) 0.309
              OrHighHigh      480.12      (7.7%)      488.87      (7.5%)    
1.8% ( -12% -   18%) 0.447
                 Prefix3      471.26     (15.0%)      480.73     (15.7%)    
2.0% ( -24% -   38%) 0.679
    BrowseDateTaxoFacets     2721.15      (4.8%)     2777.53      (4.5%)    
2.1% (  -6% -   12%) 0.163
             AndHighHigh     1023.76      (7.7%)     1045.20      (7.3%)    
2.1% ( -11% -   18%) 0.377
                 MedTerm     3813.24      (5.5%)     3898.93      (5.5%)    
2.2% (  -8% -   14%) 0.198
                  Fuzzy2      102.35     (12.7%)      104.67     (15.1%)    
2.3% ( -22% -   34%) 0.608
              AndHighLow     3004.48      (5.9%)     3073.96      (6.7%)    
2.3% (  -9% -   15%) 0.246
         MedSloppyPhrase      591.31      (4.2%)      605.55      (3.6%)    
2.4% (  -5% -   10%) 0.050
                Wildcard      544.95     (12.9%)      577.08      (7.2%)    
5.9% ( -12% -   29%) 0.074
{noformat}
And the whole result is from the Attachment: [^benchmark]

> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
>                 Key: LUCENE-10033
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10033
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>         Attachments: benchmark
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

Reply via email to