Op Wednesday 24 December 2008 17:51:04 schreef robert engels: > Thinking about this some more, you could use fixed length pages for > the term index, with a page header containing a count of entries, and > use key compression (to avoid the constant entry size). > > The problem with this is that you still have to decode the entries > (slowing the processing - since a simple binary search on the page is > not possible).
The cache between the pages and the cpu is also a bottleneck nowadays. See here: Super-Scalar RAM-CPU Cache Compression M Zukowski, S Heman, N Nes, P Boncz - cwi.nl currently available from this link: http://www.cwi.nl/htbin/ins1/publications?request=pdfgz&key=ZuHeNeBo:ICDE:06 Also, some preliminary results on lucene indexes are available at LUCENE-1410. Regards, Paul Elschot > > But, if you also add a 'least term and greatest term' to the page > header (you can avoid the duplicate storage of these entries as > well), you can perform a binary search of the term index much faster. > You only need to decode the index page containing (maybe) the desired > entry. > > If you were doing a prefix/range search, you will still end up > decoding lots of pages... > > This is why a database has their own page cache, and usually caches > the decoded form (for index pages) for faster processing - at the > expense of higher memory usage. Usually data pages are not cached in > the decoded/uncompressed form. In most cases the database vendor will > recommend removing the OS page cache on the database server, and > allocating all of the memory to the database process. > > You may be able to avoid some of the warm-up of an index using memory > mapped files, but with proper ordering of the writing of the index, > it probably isn't necessary. Beyond that, processing the term index > directly using NIO does not appear that it will be faster than using > an in-process cache of the term index (similar to the skip-to memory > index now). > > The BEST approach is probably to have the index writer build the > memory "skip to" structure as it writes the segment, and then include > this in the segment during the reopen - no warming required !. As > long as the reader and writer are in the same process, it will be a > winner ! > > On Dec 23, 2008, at 11:02 PM, robert engels wrote: > > Seems doubtful you will be able to do this without increasing the > > index size dramatically. Since it will need to be stored > > "unpacked" (in order to have random access), yet the terms are > > variable length - leading to using a maximum=minimum size for every > > term. > > > > In the end I highly doubt it will make much difference in speed - > > here's several reasons why... > > > > 1. with "fixed" size terms, the additional IO (larger pages) > > probably offsets a lot of the random access benefit. This is why > > "compressed" disks on a fast machine (CPU) are often faster than > > "uncompressed" - more data is read during every IO access. > > > > 2. with a reopen, only new segments are "read", and since it is a > > new segment, it is most likely already in the disk cache (from the > > write), so the reopen penalty is negligible (especially if the term > > index "skip to" is written last). > > > > 3. If the reopen is after an optimize - when the OS cache has > > probably been obliterated, then the warm up time is going to be > > similar in most cases anyway, since the "index" pages will also not > > be in core (in the case of memory mapped files). Again, writing the > > "skip to" last can help with this. > > > > Just because a file is memory mapped does not mean its pages will > > have an greater likelihood to be in the cache. The locality of > > reference is going to control this, just as the most/often access > > controls it in the OS disk cache. Also, most OSs will take real > > memory from the virtual address space and add it to the disk cache > > if the process is doing lots of IO. > > > > If you have a memory mapped "term index", you are still going to > > need to perform a binary search to find the correct term "page", > > and after an optimize the visited pages will not be in the cache > > (or in core). > > > > On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote: > >> On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote: > >>> Is there something that I am missing? > >> > >> Yes. > >> > >>> I see lots of references to using "memory mapped" files to > >>> "dramatically" > >>> improve performance. > >> > >> There have been substantial discussions about this design in JIRA, > >> notably LUCENE-1458. > >> > >> The "dramatic" improvement is WRT to opening/reopening an > >> IndexReader. > >> Presently in both KS and Lucene, certain data structures have to > >> be read at > >> IndexReader startup and unpacked into process memory -- in > >> particular, the > >> term dictionary index and sort caches. If those data structures > >> can be > >> represented by a memory mapped file rather than built up from > >> scratch, we save > >> big. > >> > >> Marvin Humphrey > >> > >> > >> ------------------------------------------------------------------ > >>--- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > ------------------------------------------------------------------- > >-- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org