Follow up: it seems that my mergeFactor was way to high (100) resulting in many indexes. Setting it to 5 seems to improve trivial searches a lot! At the same time, I had maxMergeDocs too small, resulting in many indexes as well.
Is there any golden rule on how to tune the settings? Would mergeFactor 5 and maxMergeDocs 1.000.000 be a proper setting to keep the number of indexes small? I suppose a high mergeFactor (100+) would improve writing speed but decrease searching performance, isn't? Regards Ard > Hello, > > I have some problems with performance of searches in > jackrabbit. I have a simple search, like, give me all nodes > where (prop1=a +prop2=b > +prop3=c + prop4=d). This is for lucene obviously an extremely simple > query. Doing this on a lucene index with millions of docs and > the number of hits is small (< 100) will result in queries > executed in couple of ms. > > When having these kind of queries in Jackrabbit, with for > example 100.000 nodes and I do the above described search > (repeated), results in > *slow* responses (couple of hundred of ms for 100.000 nodes > only). I did ask on the lucene list what impact a > MultiSearcher (I know we use a CombinedIndexReader and a > normal IndexSearcher, though I am quite convinced the problem > stays the same) has on performance with respect to a single > index. I got only one answer, but a search of say 100 indexes > would take 100 times longer (which I kind am experiencing > when the number of actual hits is small). > > I wrote a seperate programm to do some testing, like merging > the jackrabbit indexes into one single index. Then, my > queries are fast. The original reason for multiple indexes is > I think to be able to keep more indexReaders open and cache > the results, and have easier/faster incremental updating, > right? Also see [1]. Also the thread in [2] between Christoph > and Marcel might be pretty much related to this (RangeQueries > I did not test, but intuitive they will suffer even more from > multiple indexes, because each index has to expand the > RangeQuery seperately I think). The problem with slow > DescendantSelfAxisWeight won't be solved, though I did some > changes in our code to be able to know fast wether a node is > a child of some node or not (if people interested, I have > been thinking about this one, and it is a trade off between > fast renaming in jackrabbit of a node, or fast searching for > child nodes (write versus read)) > > Before I will try to see what can be changed, do other people > experience the same thing? > Might it be someting that was faster at the time of lucene > 1.9, but is now perhaps outdated? > > I also found some parts on FileSystem access for multiple > indexes is slower, because head movements during reading > might be much larger compared to a single index (though might > be platform dependant of course how the FileSystem cache is managed). > > To start with, I have tried to keep the number of indexes > created as small as possible tuning the minMergeDocs, > volatileIdleTime, maxMergeDocs and mergeFactor. Whenever my > number of documents/nodes grow however (though only 100.000 > nodes), my number of indexes grow. > > I think the idea about seperate indexes if perfectly valid, > only I want to reduce the number of indexes to no more then > for example 10. Adding each VolatileIndex when persisting it, > to an already persistent index untill for example the index > contains 100.000 docs, and then, when there are 10 of them, > merge them all, and start creating indexes of 1.000.000 docs, > untill there are 10, would perhaps benefit of both worlds. > > WDOT? Do other people experience the same problems? I do not > know how other people use JackRabbit, but the way I want to > use it mainly consists of searching. Almost everything I do > is a search. Building a website with JackRabbit as content > store results in queries all over the place, where currently, > some are IMHO to slow, and where some aren't even possible > within reasonable time scales (like, give me the most recent > 10 articles in /content/en/news//[EMAIL PROTECTED]'news'] because this > will result in a ChildAxisQuery or DescendantSelfAxisQuery > which cannot be done over millions of documents AFAICS. To > solve this at my setup, I choose to index the path of a > document, where I do realize that moving a node now becomes > expensive regarding re-indexing) > > Hope to hear what you think about it, > > Regards Ard > > [1] http://jackrabbit.apache.org/doc/arch/operate/query.html#Query > [2] > http://www.mail-archive.com/dev@jackrabbit.apache.org/msg06026.html > > -- > > Hippo > Oosteinde 11 > 1017WT Amsterdam > The Netherlands > Tel +31 (0)20 5224466 > ------------------------------------------------------------- > [EMAIL PROTECTED] / [EMAIL PROTECTED] / http://www.hippo.nl > -------------------------------------------------------------- >