[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Michael McCandless (JIRA) Tue, 16 Jul 2013 08:57:28 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13709867#comment-13709867
 ]


Michael McCandless commented on LUCENE-3069:
--------------------------------------------

bq. However, seekExact(BytesRef, TermsState) simply 'copy' the value of 
termState to enum, which doesn't actually operate 'seek' on dictionary.

This is normal / by design.  It's so that the case of seekExact(TermState) 
followed by .docs or .docsAndPositions is fast.  We only need to re-load the 
metadata if the caller then tries to do .next()

{quote}
bq. Maybe instead of term and meta members, we could just hold the current pair?

Oh, yes, I once thought about this, but not sure: like, can the callee always 
makes sure that,
when 'term()' is called, it will always return a valid term?
The codes in MemoryPF just return 'pair.output' regardless whether pair==null, 
is it safe?
{quote}

We can't guarantee that, but I think we can just check if pair == null and 
return null from term()?

{quote}
By the way, for real data, when two outputs are not 'NO_OUTPUT', even they 
contains the same metadata + stats, 
it seems to be very seldom that their arcs can be identical on FST (increases 
less than 1MB for wikimedium1m if 
equals always return false for non-singleton argument). Therefore... yes, 
hashCode() isn't necessary here.
{quote}
Hmm, but it seems like we should implement it?  Ie we do get a smaller FST when 
implementing it?
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to