As I pointed out in another email, I understand the benefits of compression (compressed disks vs. uncompressed, etc.). PFOR is definitely a winner !

As I understood this discussion though, it was an attempt to remove the in memory 'skip to' index, to avoid the reading of this during index open/reopen.

I was attempting to point out that this in-memory index is still needed, but there are ways to improve the current process.

I don't think a mapped file for the term index is going to work for a variety of reasons. Mapped files are designed as a programming simplification - mainly for older systems that use line delimited files - rather than having to create "page/section" caches when processing very large files (when only a small portion is used at any given time - ie. the data visible on the screen). When you end up visiting a large portion of the file anyway (to do a full repagination), an in-process intelligent cache is going to be far superior.

My review of the Java Buffer related classes does not give me the impression it is going to be faster - in fact it will be slower- than a single copy into user space, and process/decompress there. The Buffer system is suitable when perform little inspection, and then direct copy to another buffer (think reading from a file, and sending out on a socket). If you end up inspecting the buffer, it is going to be very slow.

On Dec 24, 2008, at 11:33 AM, Paul Elschot wrote:


Op Wednesday 24 December 2008 17:51:04 schreef robert engels:
Thinking about this some more, you could use fixed length pages for
the term index, with a page header containing a count of entries, and
use key compression (to avoid the constant entry size).

The problem with this is that you still have to decode the entries
(slowing the processing - since a simple binary search on the page is
not possible).

The cache between the pages and the cpu is also a bottleneck nowadays.
See here:

Super-Scalar RAM-CPU Cache Compression
M Zukowski, S Heman, N Nes, P Boncz - cwi.nl

currently available from this link:

http://www.cwi.nl/htbin/ins1/publications? request=pdfgz&key=ZuHeNeBo:ICDE:06

Also, some preliminary results on lucene indexes
are available at LUCENE-1410.

Regards,
Paul Elschot



But, if you also add a 'least term and greatest term' to the page
header (you can avoid the duplicate storage of these entries as
well), you can perform a binary search of the term index much faster.
You only need to decode the index page containing (maybe) the desired
entry.

If you were doing a prefix/range search, you will still end up
decoding lots of pages...

This is why a database has their own page cache, and usually caches
the decoded form (for index pages) for faster processing - at the
expense of higher memory usage. Usually data pages are not cached in
the decoded/uncompressed form. In most cases the database vendor will
recommend removing the OS page cache on the database server, and
allocating all of the memory to the database process.

You may be able to avoid some of the warm-up of an index using memory
mapped files, but with proper ordering of the writing of the index,
it probably isn't necessary. Beyond that, processing the term index
directly using NIO does not appear that it will be faster than using
an in-process cache of the term index (similar to the skip-to memory
index now).

The BEST approach is probably to have the index writer build the
memory "skip to" structure as it writes the segment, and then include
this in the segment during the reopen - no warming required !. As
long as the reader and writer are in the same process, it will be a
winner !

On Dec 23, 2008, at 11:02 PM, robert engels wrote:
Seems doubtful you will be able to do this without increasing the
index size dramatically. Since it will need to be stored
"unpacked" (in order to have random access), yet the terms are
variable length - leading to using a maximum=minimum size for every
term.

In the end I highly doubt it will make much difference in speed -
here's several reasons why...

1. with "fixed" size terms, the additional IO (larger pages)
probably offsets a lot of the random access benefit. This is why
"compressed" disks on a fast machine (CPU) are often faster than
"uncompressed" - more data is read during every IO access.

2. with a reopen, only new segments are "read", and since it is a
new segment, it is most likely already in the disk cache (from the
write), so the reopen penalty is negligible (especially if the term
index "skip to" is written last).

3. If the reopen is after an optimize - when the OS cache has
probably been obliterated, then the warm up time is going to be
similar in most cases anyway, since the "index" pages will also not
be in core (in the case of memory mapped files). Again, writing the
"skip to" last can help with this.

Just because a file is memory mapped does not mean its pages will
have an greater likelihood to be in the cache. The locality of
reference is going to control this, just as the most/often access
controls it in the OS disk cache.  Also, most OSs will take real
memory from the virtual address space and add it to the disk cache
if the process is doing lots of IO.

If you have a memory mapped "term index", you are still going to
need to perform a binary search to find the correct term "page",
and after an optimize the visited pages will not be in the cache
(or in core).

On Dec 23, 2008, at 9:20 PM, Marvin Humphrey wrote:
On Tue, Dec 23, 2008 at 08:36:24PM -0600, robert engels wrote:
Is there something that I am missing?

Yes.

I see lots of references to  using "memory mapped" files to
"dramatically"
improve performance.

There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The "dramatic" improvement is WRT to opening/reopening an
IndexReader.
Presently in both KS and Lucene, certain data structures have to
be read at
IndexReader startup and unpacked into process memory -- in
particular, the
term dictionary index and sort caches.  If those data structures
can be
represented by a memory mapped file rather than built up from
scratch, we save
big.

Marvin Humphrey


------------------------------------------------------------------
--- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

-------------------------------------------------------------------
-- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to