[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Tue, 23 Jul 2013 02:51:06 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Han Jiang updated LUCENE-3069:
------------------------------

    Attachment: LUCENE-3069.patch

Upload patch: implemented IntersectEnum.next() & seekCeil()
lots of nocommits, but passed all tests

The main idea is to run a DFS on FST, and backtrack as early as
possible (i.e. when we see this label is rejected by automaton)

For this version, there is one explicit perf overhead: I use a 
real stack here, which can be replaced by a Frame[] to reuse objects.

There're several aspects I didn't dig deep: 

* currently, CompiledAutomaton provides a commonSuffixRef, but how
  can we make use of it in FST?
* the DFS is somewhat a 'goto' version, i.e, we can make the code 
  cleaner with a single while-loop similar to BFS search. 
  However, since FST doesn't always tell us how may arcs are leaving 
  current arc, we have problem dealing with this...
* when FST is large enough, the next() operation will takes much time
  doing the linear arc read, maybe we should make use of 
  CompiledAutomaton.sortedTransition[] when leaving arcs are heavy.

                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to