[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Michael McCandless (JIRA) Sat, 22 Nov 2008 11:13:18 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649968#action_12649968
 ]


Michael McCandless commented on LUCENE-1458:
--------------------------------------------

{quote}
Nevertheless, the terms index isn't that big in comparison to, say, the size
of a posting list for a common term, so the cost of re-heating it isn't
astronomical in the grand scheme of things.
{quote}

Be careful: it's the seeking that kills you (until we switch to SSDs
at which point perhaps most of this discussion is moot!).  Even though
the terms index net size is low, if re-heating the spots you touch
incurs 20 separate page misses, you lose.

Potentially worse than the terms index are norms, if the search hits
alot of docs.

{quote}
> Take a large Jira instance...

Search responsiveness is already compromised in such a situation, because we
can all but guarantee that the posting list files have already been evicted
from cache. If the box has enough RAM for the large JIRA instance including
the Lucene index, search responsiveness won't be a problem. As soon as you
start running a little short on RAM, though, there's no way to stop infrequent
searches from being sluggish.
{quote}

If the term index and norms are pinned (or happen to still be hot), I
would expect most searches to be OK with this "in the middle" use case
because the number of seeks you'll hit should be well contained
(assuming your posting list isn't unduly fragmented by the
filesystem).  Burning through the posting list is a linear scan.
Queries that simply hit too many docs will always be slow anyways.

I think at both extremes (way too litle RAM and tons of RAM) both
approaches (pinned in RAM vs mmap'd) should perfom the same.  It's the
cases in between where I think letting VM decide whether critical
things (terms index, norms) get to stay hot is dangerous.

{quote}
The terms index could indeed get evicted some of the time on busy systems, but
the point is that the system IO cache usually works in our favor, even under
load.
{quote}

I think you're just more trusting of the IO/VM system.  I think LRU is
a poor metric.

{quote}
As far as backup daemons blowing up everybody's cache, that's stupid,
pathological behavior: <http://kerneltrap.org/node/3000#comment-8573>. Such
apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the kernel
knows it can recycle the cache pages as soon as they're cleared.
{quote}

Excellent!  If only more people knew about this.  And, if only we
could do this from javaland.  EG SegmentMerger should do this for all
segment data it's reading & writing.

{quote}
Nathan Kurz and I brainstormed this subject in a phone call this morning, and
we came up with a three-file lexicon index design:
{quote}

I don't fully understand this approach.  Would the index file pointers
point into the full lexicon's packed utf8 file, or a separate "only
terms in the index" packed utf8 file?

We currently materialize individual Strings when we load our index,
which is bad because of the GC cost, added RAM overhead (& swapping)
and because for iso8859-1 only terms we are using 2X the space over
utf8.  So I'd love to eventually do something similar (in RAM) for
Lucene.

{quote}
> Have you tried any actual tests swapping these approaches in as your
> terms index impl?

No - changing something like this requires a lot of coding, so it's better to
do thought experiments first to winnow down the options.
{quote}

Agreed.  But once you've got the mmap-based solution up and running
it'd be nice to meaure net time doing terms lookup / norms reading,
for a variety of search use cases, and plot that on a histogram.

{quote}
When I mentioned this to Nate, he remarked that we're using the OS kernel like
you're using the JVM.
{quote}

True!

{quote}
Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to
be able to launch a cheap search process.
{quote}

It seems like the ability to very quickly launch brand new searchers
is/has become a strong design goal of Lucy/KS.  What's the driver
here?  Is it for near-realtime search?  (Which I think may be better
achieved by having IndexWriter export a reader, rather than using IO
system as the intermediary).

If we fix terms index to bulk load arrays (it's not now) then the cost
of loading norms & terms index on instantiating a reader should be
fairly well contained, though not as near zero as Lucy/KS will be.

{quote}
> That's a nice goal. Our biggest cost in Lucene is warming the
> FieldCache, used for sorting, function queries, etc.

Exactly. It would be nice to add a plug-in indexing component that
writes sort caches to files that can be memory mapped at IndexReader
startup. There would be multiple files: both a solid array of 32-bit
integers mapping document number to sort order, and the field cache
values. Such a component would allow us to move the time it takes to
read in a sort cache from IndexReader-startup-time to index-time.
{quote}

Except I would have IndexReader use its RAM budget to pick & choose
which of these will be hot, and which would be mmap'd.

{quote}
Hmm, maybe we can conflate this with a column-stride field writer
and require that sort fields have a fixed width?
{quote}

Yes I think column-stride fields writer should write the docID -> ord
part of StringIndex to disk, and MultiRangeQuery in LUCENE-1461 would
then use it.  With enumerated type of fields (far fewer unique terms
than docs), bit packing will make them compact.

{quote}
In KS, the relevant IndexReader methods no longer take a Term
object. (In fact, there IS no Term object any more -
KinoSearch::Index::Term has been removed.) Instead, they take a
string field and a generic "Obj".
{quote}

But you must at least require these Obj's to know how to compareTo one
another?  Does this mean using per-field custom sort ordering
(collator) is straightforward for KS?

{quote}
I suppose we genericize this by adding a TermsDictReader/LexReader
argument to the IndexReader constructor? That way, someone can
supply a custom subclass that knows how to decode custom dictionary
files.
{quote}

Right; that's what let me create the PulsingCodec here.

The biggest problem with the "load important stuff into RAM" approach,
of course, is we can't actually pin VM pages from java, which means
the OS will happily swap out my RAM anyway, at which point of course
we should have used mmap.  Though apparently at least Windows has an
option to "optimize for services" (= "don't swap out my RAM" I think)
vs "optimize for applications", and Linux lets you tune swappiness.
But both are global.


> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to