Re: Realtime Search

Paul Elschot Wed, 24 Dec 2008 09:34:09 -0800

Op Wednesday 24 December 2008 17:51:04 schreef robert engels:
> Thinking about this some more, you could use fixed length pages for
> the term index, with a page header containing a count of entries, and
> use key compression (to avoid the constant entry size).
>
> The problem with this is that you still have to decode the entries
> (slowing the processing - since a simple binary search on the page is
> not possible).


The cache between the pages and the cpu is also a bottleneck nowadays.
See here:

Super-Scalar RAM-CPU Cache Compression
M Zukowski, S Heman, N Nes, P Boncz - cwi.nl

currently available from this link:

http://www.cwi.nl/htbin/ins1/publications?request=pdfgz&key=ZuHeNeBo:ICDE:06

Also, some preliminary results on lucene indexes
are available at LUCENE-1410.

Regards,
Paul Elschot


>
> But, if you also add a 'least term and greatest term' to the page
> header (you can avoid the duplicate storage of these entries as
> well), you can perform a binary search of the term index much faster.
> You only need to decode the index page containing (maybe) the desired
> entry.
>
> If you were doing a prefix/range search, you will still end up
> decoding lots of pages...
>
> This is why a database has their own page cache, and usually caches
> the decoded form (for index pages) for faster processing - at the
> expense of higher memory usage. Usually data pages are not cached in
> the decoded/uncompressed form. In most cases the database vendor will
> recommend removing the OS page cache on the database server, and
> allocating all of the memory to the database process.
>
> You may be able to avoid some of the warm-up of an index using memory
> mapped files, but with proper ordering of the writing of the index,
> it probably isn't necessary. Beyond that, processing the term index
> directly using NIO does not appear that it will be faster than using
> an in-process cache of the term index (similar to the skip-to memory
> index now).
>
> The BEST approach is probably to have the index writer build the
> memory "skip to" structure as it writes the segment, and then include
> this in the segment during the reopen - no warming required !. As
> long as the reader and writer are in the same process, it will be a
> winner !
>
> On Dec 23, 2008, at 11:02 PM, robert engels wrote:
> > Seems doubtful you will be able to do this without increasing the
> > index size dramatically. Since it will need to be stored
> > "unpacked" (in order to have random access), yet the terms are
> > variable length - leading to using a maximum=minimum size for every
> > term.
> >
> > In the end I highly doubt it will make much difference in speed -
> > here's several reasons why...
> >
> > 1. with "fixed" size terms, the additional IO (larger pages)
> > probably offsets a lot of the random access benefit. This is why
> > "compressed" disks on a fast machine (CPU) are often faster than
> > "uncompressed" - more data is read during every IO access.
> >
> > 2. with a reopen, only new segments are "read", and since it is a
> > new segment, it is most likely already in the disk cache (from the
> > write), so the reopen penalty is negligible (especially if the term
> > index "skip to" is written last).
> >
> > 3. If the reopen is after an optimize - when the OS cache has
> > probably been obliterated, then the warm up time is going to be
> > similar in most cases anyway, since the "index" pages will also not
> > be in core (in the case of memory mapped files). Again, writing the
> > "skip to" last can help with this.
> >
> > Just because a file is memory mapped does not mean its pages will
> > have an greater likelihood to be in the cache. The locality of
> > reference is going to control this, just as the most/often access
> > controls it in the OS disk cache.  Also, most OSs will take real
> > memory from the virtual address space and add it to the disk cache
> > if the process is doing lots of IO.
> >
> > If you have a memory mapped "term index", you are still going to
> > need to perform a binary search to find the correct term "page",
> > and after an optimize the visited pages will not be in the cache
> > (or in core).
> >
> > On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:
> >> On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:
> >>> Is there something that I am missing?
> >>
> >> Yes.
> >>
> >>> I see lots of references to  using "memory mapped" files to
> >>> "dramatically"
> >>> improve performance.
> >>
> >> There have been substantial discussions about this design in JIRA,
> >> notably LUCENE-1458.
> >>
> >> The "dramatic" improvement is WRT to opening/reopening an
> >> IndexReader.
> >> Presently in both KS and Lucene, certain data structures have to
> >> be read at
> >> IndexReader startup and unpacked into process memory -- in
> >> particular, the
> >> term dictionary index and sort caches.  If those data structures
> >> can be
> >> represented by a memory mapped file rather than built up from
> >> scratch, we save
> >> big.
> >>
> >> Marvin Humphrey
> >>
> >>
> >> ------------------------------------------------------------------
> >>--- To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Realtime Search

Reply via email to