[ https://issues.apache.org/jira/browse/LUCENE-5879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-5879: --------------------------------------- Attachment: LUCENE-5879.patch New patch, resolving a few nocommits (but more remain!). Tests pass. I created a simple benchmark, generating random longs over the full range, and random ranges as queries. For each test, I build the index once (10M longs), and then run all queries (10K random ranges) 10 times and record best time. I also print out the number of terms visited for the first query as a coarse measure of the prefix terms density. NF is numeric field and AP is auto-prefix. For AP I just index the long values as 8 byte binary term, using the same logic in NumericUtils.longToPrefixCodedBytes to make the binary sort the same as the numeric sort. Source code for benchmark is here: https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/python/autoPrefixPerf.py And here: https://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/src/main/perf/AutoPrefixPerf.java Indexer uses single thread and SerialMergeScheduler to try to measure the added indexing cost terms writing. Currently the auto-prefix terms are configured like term blocks in block tree, with a min/max number of items (terms, or "child" auto-prefix terms) per auto-prefix term. Here are the raw results ... net/net it looks like AP can match NF's performance with a ~50% smaller index and slightly faster indexing time. The two approaches are quite different: NF uses TermsEnum.seek, but AP uses Terms.intersect which for block tree does no seeking. However, there's a non-trivial indexing time & space cost vs the baseline... not sure we can/should just always enable this by default for DOCS_ONLY fields ... {noformat} Baseline (8-byte binary terms like AP) index sec 20.312 sec index MB 123.4 NF precStep=4 index sec 225.06 index MB 1275.70 term count 560 search msec 6056 NF precStep=8 index sec 115.44 index MB 675.55 term count 2983 search msec 5547 NF precStep=12 index sec 80.77 index MB 470.96 term count 37405 search msec 6080 NF precStep=16 (default) index sec 61.13 index MB 363.19 term count 125906 search msec 10466 AP min=5 max=8 index sec 194.21 index MB 272.06 term count 315 search msec 5715 AP min=5 max=12 index sec 179.86 index MB 256.88 term count 295 search msec 5771 AP min=5 max=16 index sec 168.91 index MB 254.32 term count 310 search msec 5727 AP min=5 max=20 index sec 157.48 index MB 252.04 term count 321 search msec 5742 AP min=5 max=2147483647 index sec 64.03 index MB 164.55 term count 3955 search msec 6168 AP min=10 max=18 index sec 106.52 index MB 215.26 term count 552 search msec 5792 AP min=10 max=27 index sec 99.00 index MB 212.45 term count 533 search msec 5814 AP min=10 max=36 index sec 88.45 index MB 207.43 term count 505 search msec 5850 AP min=10 max=45 index sec 79.15 index MB 194.73 term count 650 search msec 5681 AP min=10 max=2147483647 index sec 42.68 index MB 162.64 term count 6077 search msec 6199 AP min=15 max=28 index sec 84.83 index MB 204.29 term count 641 search msec 5763 AP min=15 max=42 index sec 74.20 index MB 193.24 term count 753 search msec 5828 AP min=15 max=56 index sec 63.69 index MB 190.06 term count 662 search msec 5839 AP min=15 max=70 index sec 62.53 index MB 185.96 term count 866 search msec 5846 AP min=15 max=2147483647 index sec 40.94 index MB 162.52 term count 6258 search msec 6156 AP min=20 max=38 index sec 69.26 index MB 192.11 term count 839 search msec 5837 AP min=20 max=57 index sec 60.75 index MB 186.18 term count 1034 search msec 5877 AP min=20 max=76 index sec 60.69 index MB 185.21 term count 980 search msec 5866 AP min=20 max=95 index sec 59.87 index MB 184.20 term count 985 search msec 5940 AP min=20 max=2147483647 index sec 41.64 index MB 162.52 term count 6258 search msec 6196 AP min=25 max=48 index sec 61.81 index MB 187.08 term count 806 search msec 5790 AP min=25 max=72 index sec 58.69 index MB 183.02 term count 929 search msec 5894 AP min=25 max=96 index sec 56.80 index MB 178.29 term count 841 search msec 5938 AP min=25 max=120 index sec 55.81 index MB 177.75 term count 1044 search msec 5883 AP min=25 max=2147483647 index sec 40.99 index MB 162.52 term count 6258 search msec 6189 AP min=30 max=58 index sec 56.99 index MB 182.91 term count 1012 search msec 5891 AP min=30 max=87 index sec 55.22 index MB 178.16 term count 1065 search msec 5921 AP min=30 max=116 index sec 54.78 index MB 177.39 term count 1142 search msec 5811 AP min=30 max=145 index sec 54.22 index MB 176.64 term count 1340 search msec 5798 AP min=30 max=2147483647 index sec 40.76 index MB 162.44 term count 6445 search msec 6191 AP min=35 max=68 index sec 56.75 index MB 182.45 term count 1265 search msec 5403 AP min=35 max=102 index sec 53.87 index MB 177.99 term count 1302 search msec 5924 AP min=35 max=136 index sec 51.10 index MB 177.25 term count 1547 search msec 6082 AP min=35 max=170 index sec 50.77 index MB 176.93 term count 1402 search msec 5994 AP min=35 max=2147483647 index sec 40.73 index MB 161.75 term count 8045 search msec 6271 AP min=40 max=78 index sec 55.75 index MB 185.57 term count 1499 search msec 5917 AP min=40 max=117 index sec 53.12 index MB 180.57 term count 1501 search msec 5961 AP min=40 max=156 index sec 50.93 index MB 179.38 term count 1806 search msec 5487 AP min=40 max=195 index sec 50.31 index MB 178.81 term count 1763 search msec 5484 AP min=40 max=2147483647 index sec 38.59 index MB 158.98 term count 12573 search msec 6281 {noformat} > Add auto-prefix terms to block tree terms dict > ---------------------------------------------- > > Key: LUCENE-5879 > URL: https://issues.apache.org/jira/browse/LUCENE-5879 > Project: Lucene - Core > Issue Type: New Feature > Components: core/codecs > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 5.0, 4.10 > > Attachments: LUCENE-5879.patch, LUCENE-5879.patch, LUCENE-5879.patch, > LUCENE-5879.patch > > > This cool idea to generalize numeric/trie fields came from Adrien: > Today, when we index a numeric field (LongField, etc.) we pre-compute > (via NumericTokenStream) outside of indexer/codec which prefix terms > should be indexed. > But this can be inefficient: you set a static precisionStep, and > always add those prefix terms regardless of how the terms in the field > are actually distributed. Yet typically in real world applications > the terms have a non-random distribution. > So, it should be better if instead the terms dict decides where it > makes sense to insert prefix terms, based on how dense the terms are > in each region of term space. > This way we can speed up query time for both term (e.g. infix > suggester) and numeric ranges, and it should let us use less index > space and get faster range queries. > > This would also mean that min/maxTerm for a numeric field would now be > correct, vs today where the externally computed prefix terms are > placed after the full precision terms, causing hairy code like > NumericUtils.getMaxInt/Long. So optos like LUCENE-5860 become > feasible. > The terms dict can also do tricks not possible if you must live on top > of its APIs, e.g. to handle the adversary/over-constrained case when a > given prefix has too many terms following it but finer prefixes > have too few (what block tree calls "floor term blocks"). -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org