[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

Earwin Burrfoot (JIRA) Sun, 09 Jan 2011 07:27:08 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979346#action_12979346
 ]


Earwin Burrfoot commented on LUCENE-2843:
-----------------------------------------

bq. I don't like the reasoning that, just because sphinx does it and their 
'users manage', that makes it ok.
I'm in no way advocating it as an all-round better solution. It has it's 
wrinkles just as anything else.
My reasoning is merely that alternative exists, and it is viable. As proven by 
pretty high-profile users.
They have memory-resident term dictionary, and it works, I heard no complaints 
regarding this ever.

bq. sphinx also requires mysql
Have you read anything at all? It has an integration ready, for the layman user 
who just wants to stick a fulltext search into their little app, but it is in 
no way reliant on it.
Sphinx is a direct alternative to Solr.

{quote}
But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good 
decisions on what gets swapped out is risky - Lucene is better informed than 
the OS on which data structures are worth spending RAM on (norms, terms index, 
field cache, del docs).
If indeed the terms dict (thanks to FSTs) becomes small enough to "fit" in RAM, 
then we should load it into RAM (and do away w/ the terms index).
{quote}
That's a bit delusional. If a system is forced to swap out, it'll swap your 
explicitly managed RAM just as likely as memory-mapped files. I've seen this 
countless times.
But then, you have a number of benefits - like sharing filesystem cache when 
opening same file multiple times, offloading things from Java heap (which is 
almost always a good thing), fastest load-into-memory times possible.


Sorry, if I sound offending at times, but, damn, there's a whole world of 
simple and efficient code lying ahead in that direction :)

> Add variable-gap terms index impl.
> ----------------------------------
>
>                 Key: LUCENE-2843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

Reply via email to