Seems Doug is correct. I ran our tests through the profiler. Most of the time is spent in reading/parsing SegmentTermDocs (see the very interesting attached profiler output).
I was amazed at how much time is spent in both readVint and readByte(). Seems high, but I think it is mainly due to the number of invocations. 1) What if BufferedIndexInput had an optimized version of readVint that used the buffer and manipulated the position directly? 2) Instead of caching the TermInfo, what if the TermDocs were cached (again for the top 20% terms). The memory requirement would be much greater, but you could also say "do not cache the TermDocs that had more than X documents". The optimized searcher already converts TermQueries similar to this to a Filter anyway. -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, May 22, 2006 11:33 AM To: java-dev@lucene.apache.org Subject: Re: caching term information? Marvin Humphrey wrote: > On May 20, 2006, at 12:01 AM, Robert Engels wrote: > >> Maybe don't cache the term pages, then, just cache the frequently >> requested >> terms themselves. > > > That sounds like a winner. Search term frequencies follow a power law > distribution. Cache the top 20% or so in an LRU and you'll cut down on > disk seeks and linear scanning significantly. Keep in mind that the .tis file is compressed: it uses far less memory per term than a TermInfo does. So, to minimize disk i/o, one should leave things compressed and cache portions of the .tis file instead. The OS's buffer cache should do this well for you. But if the system call overhead is causing significant delay, then the .tis file could be memory mapped. And if constructing and scanning TermInfos is the primary delay, then, of course, a cache of TermInfo's might be indicated. In summary, there are lots of possible places to optimize here, but it's not clear which, if any, are warranted. Folks have benchmarked a TermInfo cache before and not found it advantagous. But perhaps your uses are sufficiently different that this is no longer the case. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]