[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Tue, 23 Jul 2013 19:41:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13717911#comment-13717911
 ]


Han Jiang commented on LUCENE-3069:
-----------------------------------

bq. You should not need to .getPosition / .setPosition on the fstReader:

Oh, yes! I'll fix.

bq. I think we can't really make use of it, which is fine (it's an optional 
optimization).

OK, actually I was quite curious why we don't make use of commonPrefixRef 
in CompiledAutomaton. Maybe we can determinize the input Automaton first, then
get commonPrefixRef via SpecialOperation? Is it too slow, or the prefix isn't
always long enough to take into consideration?

bq. But this can only be done if that FST node's arcs are array'd right?

Yes, array arcs only, and we might need methods like advance(label) to do the 
search,
and here gossip search might work better than traditional binary search.

{quote}
Separately, supporting ord w/ FST terms dict should in theory be not
so hard; you'd need to use getByOutput to seek by ord. Maybe (later,
eventually) we can make this a write-time option. We should open a
separate issue ...
{quote}

Ah, yes, but seems that getByOutput doesn't rewind/reuse previous state?
We always have to start from first arc during every seek. However, I'm 
not sure in what kinds of usecase we need the ord information.


I'll commit current version first, so we can iterate.
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to