Inefficient/useless UUIDDocId cache ( was : Search performance : MultiIndex)

Ard Schrijvers Fri, 09 Nov 2007 03:45:42 -0800

Hello,

> > Christoph Kiehl wrote:
> > My hope is that we find exactly that kind of clever cache ;) I even 
> > think we should make this one persistent because you have to read 
> > every node to build it and I think it might consume to much 
> memory for 
> > big repositories (at least if we keep the paths human readable like 
> > above).
>


I have been testing a little more. I have not been able yet to improve
the speed of the first call to DescendantSelfAxisQuery (I hope to find
some time for it soon) but I did find a large improvement for subsequent
queries (only measurable for repository >100.000 nodes), because of an
IMHO very inefficient caching of UUIDDocId which is used by the *highly*
expensive 

CachingIndexReader.getParent(int n, BitSet deleted)

It took me a while to see what was going wrong, but for subsequent
queries with DescendantSelfAxisQuery, a few went fast, and then, pretty
arbitrary, one subsequent exact same query took 10-25 times longer. The
problem is in the fact that 

DocId.UUIDDocId caches its documentnumber in
getDocumentNumber(IndexReader reader) with a WeakReference to reader
like 

private Reference reader;
this.reader = new WeakReference(reader);

In getDocumentNumber, at
 
synchronized (this) {
  if (this.reader != null && reader.equals(this.reader.get())) {
     return docNumber;
}

it is fast if and only if this.reader != null &&
reader.equals(this.reader.get()) returns true. 

But, the reader object which is passed into
getDocumentNumber(IndexReader reader) is recreated in
SearchIndex.getIndexReader for every query again (AFAIU)! Even though
the reader.equals(this.reader.get()) keeps being valid, because the
hashcode and equals of the newly created reader matches the previous
one, the this.reader is null (after random gc()) because the garbage
collector was able to clear *all* the UUIDDocId.reader references, and
thus, the highly expensible getDocumentNumber needs to run again.! 

In other words, the CachingIndexReader.getParent almost always has an
expensive 

parent = DocId.create(parentUUID);

A very esay solution to this problem, is to have the
SearchIndex.getInderReader have returned the same combinedIndexReader
instance (for example a static one which gets cleared when a index
changed) while all indexes are unchanged. Just returning the same
instance solves the problem (I also suppose that originally it will have
worked like this but ofcourse, when using WeakReferences it is very hard
to keep this all in mind)

A further memory optimalisation is that I think it is redundant in
UUIDDocId to store the 'private String uuid'. The number of cached
UUIDDocId can become *very* large, and storing for example 1.000.000
UUIDDocId will, if each uuid for example is around 100 bit, 100 Mb. 

When my change with respect to SearchIndex.getInderReader is done, then
in principle we might put the uuid in UUIDDocId to null after the
docNumber has been computed. This is because when any index changes, a
new CachingIndexReader must be created, hence also private final DocId[]
parents will be recreated. To, I must admit it is error prone to set the
uuid to null, when for some reason 

this.reader != null && reader.equals(this.reader.get()) returns false
while not a new CachingIndexReader was created (which in principle
should be impossible)

Because you might easily end up with  NPEs because of this, we still
might store the uuid, but at least only store the  UUID.lsb and
UUID.msb. This saves a lot a memory, and might actually be pretty
signifcant when the number of nodes grow. 

Also regarding Christoph's suggestion for a smart hierarchical clever
cache....this getParent in combination with the UUIDDocId is actually
exactly a hierarchical cache. It only didn't really work because of
gc()'s. Still, we have the problem of slow first queries, or after a
changed index, but we could think of pre-caching for the getParent or
keeping the cached UUID in sync with a changing index (though this is
almost impossible because lucene does not have static document
positions)

WDOT? I hope I have been able to desribe the problem well, because it
always gets pretty hard when WeakReferences are involved (even face2face
on a whiteboard). It is hard to show the behaviour with a profiler or
with debugging. The only real way to see the problem is to have a
*large* repository, and run 

result = q.execute();
result = q.execute();
result = q.execute();
System.gc();
result = q.execute();
System.gc();
result = q.execute();
     
for a query involving DescendantSelfAxisQuery. You'll see that after
gc(), the results takes much longer (x25 i have seen)

Regards Ard

ps if clear I can make a seperate JIRA issue for this, because I think
it is to specific for 
https://issues.apache.org/jira/browse/JCR-1196



>

Inefficient/useless UUIDDocId cache ( was : Search performance : MultiIndex)

Reply via email to