[jira] [Updated] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Han Jiang (JIRA) Sun, 16 Jun 2013 03:03:24 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Han Jiang updated LUCENE-5029:
------------------------------

    Attachment: LUCENE-5029.patch

This patch keeps the original 'customize termstate in PBF' design. 
It also pushes flushTermsBlock & readTermsBlock to term dict side.

Now the rule is: if you PBF have some monotonical but 'don't care' values,
always fill -1 on them, so that term dict will reuse previous values to
'pad' that -1s. Yes Mike, the algebra is really simple :)

But I still have a problem removing that termBlockOrd from BlockTermState:
every time a caller uses seekExact(), it is expected to get a new term
state in which 'termBlockOrd' is involved. However I cannot fully 
understand how this variable works, and maybe we can use metadataUpto
to replace this? I'll try this later.

Can you put the TestDrillSideway fix in lucene3069 branch as well? 
Thanks :)

                
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5029
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, 
> LUCENE-5029.branch-init.patch, LUCENE-5029.patch, LUCENE-5029.patch, 
> LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use 
> FST<BytesRef> as a base data structure, this might not share much data in 
> parent arcs, since the encoded BytesRef doesn't guarantee that 
> 'Outputs.common()' always creates a long prefix. 
> While for current postings format, it is guaranteed that each FP (pointing to 
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
> means, between two Outputs, the Outputs from smaller term can be safely 
> pushed towards root. However we always have some tricky TermState to deal 
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
> can simply cut the whole TermState into two parts: one part for comparation 
> and intersection, another for restoring generic data. Then the data structure 
> will be clear: this generic 'TermState' will consist of a fixed-length 
> LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Reply via email to