[
https://issues.apache.org/jira/browse/LUCENE-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033650#comment-16033650
]
Mikhail Khludnev commented on LUCENE-7863:
------------------------------------------
Let's index six one word docs:
|foo|
|foo|
|foo|
|bar|
|bar|
|bar|
h3. Index with ReversedWildcardFilter
|term|posting offset (relative)|
|1oof|0|
|1rab|3|
|bar|3|
|foo|3|
|Postings (absolute values)|
|0,1,2|
|3,4,5|
|3,4,5|
|0,1,2|
Here you see that postings (and positions) are duplicated for every derived
term.
h2. Proposal - DRY
|term|posting offset (relative)|
|1oof|0|
|1rab|3|
|bar|-3|
|foo|-3|
|Postings (absolute values)|
|0,1,2|
|3,4,5|
h2. Note
It seems like it's really challenging to implement, giving that codecs doesn't
allow such tweaking, I had to change {{o.a.l.i}} classes. This code introduces
the relation between terms see {{FreqProxTermsEnum.getTwinTerm()}} and so one
(it's one of the ugliest pieces). It also requires to change the term block
format: posting offsets are written in ZLong (instead of Vlong), since they
need to be negative. I'm afraid it ruins a lot of tests, since I were
interested in the only one {{TestReversedWildcardFilterFactory}}. It passes. I
also experiment with 5M enwiki and it seems roughly works: RWF blows index from
13G to 28G and this code keeps it at 17G and runs *leading queries fast.
It aims only {{RWF}} where derived term is 1-1 to the origin one. This patch
for branch_6x.
h2. Disclaimer
Current patch is mad and dirty ({{trickedFields = Arrays.asList("one",
"body_txt_en")}}, and plenty of {{sysout}} ), I've just scratched the idea.
h2. TODO
- How to carry relation between origin and derived NGramm terms (1 - Many)?
- How to adjust the current {{o.a.l.i}} to bring reduplicated postings to the
codec?
h2. The next idea
For \*infix\* searches it needs to derive the following terms (for three
{{bar}} docs and thee {{baz}} docs):
|term|position offset|
|ar_bar|0|
|az_baz|3|
|bar|-3|
|baz|3
|r_bar|-3|
|z_baz|3|
Here we should write both postings only once. And on {{\*a\*}} query find both
posting with a prefix query {{a\*}}.
> Don't repeat postings and positions on ReverseWF, EdgeNGram, etc
> ------------------------------------------------------------------
>
> Key: LUCENE-7863
> URL: https://issues.apache.org/jira/browse/LUCENE-7863
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Mikhail Khludnev
>
> h2. Context
> \*suffix and \*infix\* searches on large indexes.
> h2. Problem
> Obviously applying {{ReversedWildcardFilter}} doubles an index size, and I'm
> shuddering to think about EdgeNGrams...
> h2. Proposal
> _DRY_
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]