[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Michael McCandless (JIRA) Mon, 25 Aug 2014 14:45:19 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5879:
---------------------------------------

    Attachment: LUCENE-5879.patch

New patch, resolving a few nocommits (but more remain!).  Tests pass.

I created a simple benchmark, generating random longs over the full
range, and random ranges as queries.  For each test, I build the index
once (10M longs), and then run all queries (10K random ranges) 10
times and record best time.  I also print out the number of terms
visited for the first query as a coarse measure of the prefix terms
density.

NF is numeric field and AP is auto-prefix. For AP I just index the
long values as 8 byte binary term, using the same logic in
NumericUtils.longToPrefixCodedBytes to make the binary sort the same
as the numeric sort.

Source code for benchmark is here:

  
https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/python/autoPrefixPerf.py

And here:

  
https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/main/perf/AutoPrefixPerf.java

Indexer uses single thread and SerialMergeScheduler to try to measure
the added indexing cost terms writing.

Currently the auto-prefix terms are configured like term blocks in
block tree, with a min/max number of items (terms, or "child"
auto-prefix terms) per auto-prefix term.

Here are the raw results ... net/net it looks like AP can match NF's
performance with a ~50% smaller index and slightly faster indexing
time.  The two approaches are quite different: NF uses TermsEnum.seek,
but AP uses Terms.intersect which for block tree does no seeking.

However, there's a non-trivial indexing time & space cost vs the
baseline... not sure we can/should just always enable this by default
for DOCS_ONLY fields ...

{noformat}
Baseline (8-byte binary terms like AP)
    index sec 20.312 sec
    index MB 123.4

NF precStep=4
    index sec 225.06
    index MB 1275.70
    term count 560
    search msec 6056
NF precStep=8
    index sec 115.44
    index MB 675.55
    term count 2983
    search msec 5547
NF precStep=12
    index sec 80.77
    index MB 470.96
    term count 37405
    search msec 6080
NF precStep=16 (default)
    index sec 61.13
    index MB 363.19
    term count 125906
    search msec 10466
AP min=5 max=8
    index sec 194.21
    index MB 272.06
    term count 315
    search msec 5715
AP min=5 max=12
    index sec 179.86
    index MB 256.88
    term count 295
    search msec 5771
AP min=5 max=16
    index sec 168.91
    index MB 254.32
    term count 310
    search msec 5727
AP min=5 max=20
    index sec 157.48
    index MB 252.04
    term count 321
    search msec 5742
AP min=5 max=2147483647
    index sec 64.03
    index MB 164.55
    term count 3955
    search msec 6168
AP min=10 max=18
    index sec 106.52
    index MB 215.26
    term count 552
    search msec 5792
AP min=10 max=27
    index sec 99.00
    index MB 212.45
    term count 533
    search msec 5814
AP min=10 max=36
    index sec 88.45
    index MB 207.43
    term count 505
    search msec 5850
AP min=10 max=45
    index sec 79.15
    index MB 194.73
    term count 650
    search msec 5681
AP min=10 max=2147483647
    index sec 42.68
    index MB 162.64
    term count 6077
    search msec 6199
AP min=15 max=28
    index sec 84.83
    index MB 204.29
    term count 641
    search msec 5763
AP min=15 max=42
    index sec 74.20
    index MB 193.24
    term count 753
    search msec 5828
AP min=15 max=56
    index sec 63.69
    index MB 190.06
    term count 662
    search msec 5839
AP min=15 max=70
    index sec 62.53
    index MB 185.96
    term count 866
    search msec 5846
AP min=15 max=2147483647
    index sec 40.94
    index MB 162.52
    term count 6258
    search msec 6156
AP min=20 max=38
    index sec 69.26
    index MB 192.11
    term count 839
    search msec 5837
AP min=20 max=57
    index sec 60.75
    index MB 186.18
    term count 1034
    search msec 5877
AP min=20 max=76
    index sec 60.69
    index MB 185.21
    term count 980
    search msec 5866
AP min=20 max=95
    index sec 59.87
    index MB 184.20
    term count 985
    search msec 5940
AP min=20 max=2147483647
    index sec 41.64
    index MB 162.52
    term count 6258
    search msec 6196
AP min=25 max=48
    index sec 61.81
    index MB 187.08
    term count 806
    search msec 5790
AP min=25 max=72
    index sec 58.69
    index MB 183.02
    term count 929
    search msec 5894
AP min=25 max=96
    index sec 56.80
    index MB 178.29
    term count 841
    search msec 5938
AP min=25 max=120
    index sec 55.81
    index MB 177.75
    term count 1044
    search msec 5883
AP min=25 max=2147483647
    index sec 40.99
    index MB 162.52
    term count 6258
    search msec 6189
AP min=30 max=58
    index sec 56.99
    index MB 182.91
    term count 1012
    search msec 5891
AP min=30 max=87
    index sec 55.22
    index MB 178.16
    term count 1065
    search msec 5921
AP min=30 max=116
    index sec 54.78
    index MB 177.39
    term count 1142
    search msec 5811
AP min=30 max=145
    index sec 54.22
    index MB 176.64
    term count 1340
    search msec 5798
AP min=30 max=2147483647
    index sec 40.76
    index MB 162.44
    term count 6445
    search msec 6191
AP min=35 max=68
    index sec 56.75
    index MB 182.45
    term count 1265
    search msec 5403
AP min=35 max=102
    index sec 53.87
    index MB 177.99
    term count 1302
    search msec 5924
AP min=35 max=136
    index sec 51.10
    index MB 177.25
    term count 1547
    search msec 6082
AP min=35 max=170
    index sec 50.77
    index MB 176.93
    term count 1402
    search msec 5994
AP min=35 max=2147483647
    index sec 40.73
    index MB 161.75
    term count 8045
    search msec 6271
AP min=40 max=78
    index sec 55.75
    index MB 185.57
    term count 1499
    search msec 5917
AP min=40 max=117
    index sec 53.12
    index MB 180.57
    term count 1501
    search msec 5961
AP min=40 max=156
    index sec 50.93
    index MB 179.38
    term count 1806
    search msec 5487
AP min=40 max=195
    index sec 50.31
    index MB 178.81
    term count 1763
    search msec 5484
AP min=40 max=2147483647
    index sec 38.59
    index MB 158.98
    term count 12573
    search msec 6281
{noformat}


> Add auto-prefix terms to block tree terms dict
> ----------------------------------------------
>
>                 Key: LUCENE-5879
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5879
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 5.0, 4.10
>
>         Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, 
> LUCENE-5879.patch
>
>
> This cool idea to generalize numeric/trie fields came from Adrien:
> Today, when we index a numeric field (LongField, etc.) we pre-compute
> (via NumericTokenStream) outside of indexer/codec which prefix terms
> should be indexed.
> But this can be inefficient: you set a static precisionStep, and
> always add those prefix terms regardless of how the terms in the field
> are actually distributed.  Yet typically in real world applications
> the terms have a non-random distribution.
> So, it should be better if instead the terms dict decides where it
> makes sense to insert prefix terms, based on how dense the terms are
> in each region of term space.
> This way we can speed up query time for both term (e.g. infix
> suggester) and numeric ranges, and it should let us use less index
> space and get faster range queries.
>  
> This would also mean that min/maxTerm for a numeric field would now be
> correct, vs today where the externally computed prefix terms are
> placed after the full precision terms, causing hairy code like
> NumericUtils.getMaxInt/Long.  So optos like LUCENE-5860 become
> feasible.
> The terms dict can also do tricks not possible if you must live on top
> of its APIs, e.g. to handle the adversary/over-constrained case when a
> given prefix has too many terms following it but finer prefixes
> have too few (what block tree calls "floor term blocks").



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5879) Add auto-prefix terms to block tree terms dict

Reply via email to