[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Han Jiang (JIRA) Fri, 06 Sep 2013 05:05:21 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760160#comment-13760160
 ]


Han Jiang commented on LUCENE-3069:
-----------------------------------

Mike, thanks for the review!

bq. In general, couldn't the writer re-use the reader's TermState?

I'm afraid this somewhat makes codes longer? I'll make a patch to see this.

{quote}
Have you run "first do no harm" perf tests? Ie, compare current trunk
w/ default Codec to branch w/ default Codec? Just to make sure there
are no surprises...
{quote}

Yes, no surprise yet.

bq. Why does Lucene41PostingsWriter have "impersonation" code? 

Yeah, these should be removed.

{quote}
I forget: why does the postings reader/writer need to handle delta
coding again (take an absolute boolean argument)? Was it because of
pulsing or sep? It's fine for now (progress not perfection) ... but
not clean, since "delta coding" is really an encoding detail so in
theory the terms dict should "own" that ...
{quote}

Ah, yes, because of pulsing.

This is because.. PulsingPostingsBase is more than a PostingsBaseFormat. 
It somewhat acts like a term dict, e.g. it needs to understand how terms are 
structured in one block (term No.1 uses absolute value, term No.x use delta 
value)
then judge how to restruct the inlined and wrapped block (No.1 still uses 
absolute value,
but the first-non-pulsed term will need absolute encoding as well). 

Without the argument 'absolute', the real term dictionary will do the delta 
encoding itself,
then PulsingPostingsBase will be confused, and all wrapped PostingsBase have to 
encode 
metadata values without delta-format.



{quote}
The new .smy file for Pulsing is sort of strange ... but necessary
since it always uses 0 longs, so we have to store this somewhere
... you could put it into FieldInfo attributes instead?
{quote}

Yeah, it is another hairy thing... the reason is, we don't have a 
'PostingsTrailer'
for PostingsBaseFormat. Pulsing will not know the longs size for each field, 
until 
all the fields are consumed... and it should not write those longsSize to 
termsOut in close()
since the term dictionary will use the DirTrailer hack here. (maybe every term 
dictionary
should close postingsWriter first, then write field summary and close itself? 
I'm not sure 
though). 


bq. Should we backport this to 4.x? 

Yeah, OK!
                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 5.0, 4.5
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, LUCENE-3069.patch, 
> LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to