[jira] [Updated] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Han Jiang (JIRA) Sat, 15 Jun 2013 02:54:24 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Han Jiang updated LUCENE-5029:
------------------------------

    Attachment: LUCENE-5029.patch
                LUCENE-5029.algebra.patch

Update reader part, now we can safely remove termBlockOrd in BlockTermState, 
which means the API is OK for non-block based term dict. As for FST-based term 
dict,
the remaining job is to then bring back TermStateOuputs.

The patch is still against trunk, but strange that it will fail on this single 
test:

{code}
ant test  -Dtestcase=TestDrillSideways -Dtests.method=testRandom 
-Dtests.seed=7FEAE9B6DF414156 -Dtests.slow=true 
-Dtests.postingsformat=TempBlock -Dtests.locale=ar_KW 
-Dtests.timezone=America/Indiana/Winamac -Dtests.file.encoding=US-ASCII
{code}

But I suppose it is unrelated?
                
> factor out a generic 'TermState' for better sharing in FST-based term dict
> --------------------------------------------------------------------------
>
>                 Key: LUCENE-5029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5029
>             Project: Lucene - Core
>          Issue Type: Sub-task
>            Reporter: Han Jiang
>            Assignee: Han Jiang
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-5029.algebra.patch, LUCENE-5029.algebra.patch, 
> LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch, LUCENE-5029.patch
>
>
> Currently, those two FST-based term dict (memory codec & blocktree) all use 
> FST<BytesRef> as a base data structure, this might not share much data in 
> parent arcs, since the encoded BytesRef doesn't guarantee that 
> 'Outputs.common()' always creates a long prefix. 
> While for current postings format, it is guaranteed that each FP (pointing to 
> .doc, .pos, etc.) will increase monotonically with 'larger' terms. That 
> means, between two Outputs, the Outputs from smaller term can be safely 
> pushed towards root. However we always have some tricky TermState to deal 
> with (like the singletonDocID for pulsing trick), so as Mike suggested, we 
> can simply cut the whole TermState into two parts: one part for comparation 
> and intersection, another for restoring generic data. Then the data structure 
> will be clear: this generic 'TermState' will consist of a fixed-length 
> LongsRef and variable-length BytesRef. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-5029) factor out a generic 'TermState' for better sharing in FST-based term dict

Reply via email to