Hello, > > Christoph Kiehl wrote: > > My hope is that we find exactly that kind of clever cache ;) I even > > think we should make this one persistent because you have to read > > every node to build it and I think it might consume to much > memory for > > big repositories (at least if we keep the paths human readable like > > above). >
I have been testing a little more. I have not been able yet to improve the speed of the first call to DescendantSelfAxisQuery (I hope to find some time for it soon) but I did find a large improvement for subsequent queries (only measurable for repository >100.000 nodes), because of an IMHO very inefficient caching of UUIDDocId which is used by the *highly* expensive CachingIndexReader.getParent(int n, BitSet deleted) It took me a while to see what was going wrong, but for subsequent queries with DescendantSelfAxisQuery, a few went fast, and then, pretty arbitrary, one subsequent exact same query took 10-25 times longer. The problem is in the fact that DocId.UUIDDocId caches its documentnumber in getDocumentNumber(IndexReader reader) with a WeakReference to reader like private Reference reader; this.reader = new WeakReference(reader); In getDocumentNumber, at synchronized (this) { if (this.reader != null && reader.equals(this.reader.get())) { return docNumber; } it is fast if and only if this.reader != null && reader.equals(this.reader.get()) returns true. But, the reader object which is passed into getDocumentNumber(IndexReader reader) is recreated in SearchIndex.getIndexReader for every query again (AFAIU)! Even though the reader.equals(this.reader.get()) keeps being valid, because the hashcode and equals of the newly created reader matches the previous one, the this.reader is null (after random gc()) because the garbage collector was able to clear *all* the UUIDDocId.reader references, and thus, the highly expensible getDocumentNumber needs to run again.! In other words, the CachingIndexReader.getParent almost always has an expensive parent = DocId.create(parentUUID); A very esay solution to this problem, is to have the SearchIndex.getInderReader have returned the same combinedIndexReader instance (for example a static one which gets cleared when a index changed) while all indexes are unchanged. Just returning the same instance solves the problem (I also suppose that originally it will have worked like this but ofcourse, when using WeakReferences it is very hard to keep this all in mind) A further memory optimalisation is that I think it is redundant in UUIDDocId to store the 'private String uuid'. The number of cached UUIDDocId can become *very* large, and storing for example 1.000.000 UUIDDocId will, if each uuid for example is around 100 bit, 100 Mb. When my change with respect to SearchIndex.getInderReader is done, then in principle we might put the uuid in UUIDDocId to null after the docNumber has been computed. This is because when any index changes, a new CachingIndexReader must be created, hence also private final DocId[] parents will be recreated. To, I must admit it is error prone to set the uuid to null, when for some reason this.reader != null && reader.equals(this.reader.get()) returns false while not a new CachingIndexReader was created (which in principle should be impossible) Because you might easily end up with NPEs because of this, we still might store the uuid, but at least only store the UUID.lsb and UUID.msb. This saves a lot a memory, and might actually be pretty signifcant when the number of nodes grow. Also regarding Christoph's suggestion for a smart hierarchical clever cache....this getParent in combination with the UUIDDocId is actually exactly a hierarchical cache. It only didn't really work because of gc()'s. Still, we have the problem of slow first queries, or after a changed index, but we could think of pre-caching for the getParent or keeping the cached UUID in sync with a changing index (though this is almost impossible because lucene does not have static document positions) WDOT? I hope I have been able to desribe the problem well, because it always gets pretty hard when WeakReferences are involved (even face2face on a whiteboard). It is hard to show the behaviour with a profiler or with debugging. The only real way to see the problem is to have a *large* repository, and run result = q.execute(); result = q.execute(); result = q.execute(); System.gc(); result = q.execute(); System.gc(); result = q.execute(); for a query involving DescendantSelfAxisQuery. You'll see that after gc(), the results takes much longer (x25 i have seen) Regards Ard ps if clear I can make a seperate JIRA issue for this, because I think it is to specific for https://issues.apache.org/jira/browse/JCR-1196 >