RE: Search performance : MultiIndex

Ard Schrijvers Fri, 26 Oct 2007 05:46:28 -0700

Follow up: it seems that my mergeFactor was way to high (100) resulting
in many indexes. Setting it to 5 seems to improve trivial searches a
lot! At the same time, I had maxMergeDocs too small, resulting in many
indexes as well.


Is there any golden rule on how to tune the settings? Would mergeFactor
5 and maxMergeDocs 1.000.000 be a proper setting to keep the number of
indexes small? I suppose a high mergeFactor (100+) would improve writing
speed but decrease searching performance, isn't?

Regards Ard

> Hello,
> 
> I have some problems with performance of searches in 
> jackrabbit. I have a simple search, like, give me all nodes 
> where (prop1=a +prop2=b
> +prop3=c + prop4=d). This is for lucene obviously an extremely simple
> query. Doing this on a lucene index with millions of docs and 
> the number of hits is  small (< 100) will result in queries 
> executed in couple of ms. 
> 
> When having these kind of queries in Jackrabbit, with for 
> example 100.000 nodes and I do the above described search 
> (repeated), results in
> *slow* responses (couple of hundred of ms for 100.000 nodes 
> only). I did ask on the lucene list what impact a 
> MultiSearcher (I know we use a CombinedIndexReader and a 
> normal IndexSearcher, though I am quite convinced the problem 
> stays the same) has on performance with respect to a single 
> index. I got only one answer, but a search of say 100 indexes 
> would take 100 times longer (which I kind am experiencing 
> when the number of actual hits is small). 
> 
> I wrote a seperate programm to do some testing, like merging 
> the jackrabbit indexes into one single index. Then, my 
> queries are fast. The original reason for multiple indexes is 
> I think to be able to keep more indexReaders open and cache 
> the results, and have easier/faster incremental updating, 
> right? Also see [1]. Also the thread in [2] between Christoph 
> and Marcel might be pretty much related to this (RangeQueries 
> I did not test, but intuitive they will suffer even more from 
> multiple indexes, because each index has to expand the 
> RangeQuery seperately I think). The problem with slow 
> DescendantSelfAxisWeight won't be solved, though I did some 
> changes in our code to be able to know fast wether a node is 
> a child of some node or not (if people interested, I have 
> been thinking about this one, and it is a trade off between 
> fast renaming in jackrabbit of a node, or fast searching for 
> child nodes (write versus read))
> 
> Before I will try to see what can be changed, do other people 
> experience the same thing? 
> Might it be someting that was faster at the time of lucene 
> 1.9, but is now perhaps outdated? 
> 
> I also found some parts on FileSystem access for multiple 
> indexes is slower, because head movements during reading 
> might be much larger compared to a single index (though might 
> be platform dependant of course how the FileSystem cache is managed). 
> 
> To start with, I have tried to keep the number of indexes 
> created as small as possible tuning the minMergeDocs, 
> volatileIdleTime, maxMergeDocs and mergeFactor. Whenever my 
> number of documents/nodes grow however (though only 100.000 
> nodes), my number of indexes grow. 
> 
> I think the idea about seperate indexes if perfectly valid, 
> only I want to reduce the number of indexes to no more then 
> for example 10. Adding each VolatileIndex when persisting it, 
> to an already persistent index untill for example the index 
> contains 100.000 docs, and then, when there are 10 of them, 
> merge them all, and start creating indexes of 1.000.000 docs, 
> untill there are 10, would perhaps benefit of both worlds. 
> 
> WDOT? Do other people experience the same problems? I do not 
> know how other people use JackRabbit, but the way I want to 
> use it mainly consists of searching. Almost everything I do 
> is a search. Building a website with JackRabbit as content 
> store results in queries all over the place, where currently, 
> some are IMHO to slow, and where some aren't even possible 
> within reasonable time scales (like, give me the most recent 
> 10 articles in /content/en/news//[EMAIL PROTECTED]'news'] because this 
> will result in a ChildAxisQuery or DescendantSelfAxisQuery 
> which cannot be done over millions of documents AFAICS. To 
> solve this at my setup, I choose to index the path of a 
> document, where I do realize that moving a node now becomes 
> expensive regarding re-indexing)
> 
> Hope to hear what you think about it,
> 
> Regards Ard
> 
> [1] http://jackrabbit.apache.org/doc/arch/operate/query.html#Query
> [2] 
> http://www.mail-archive.com/dev@jackrabbit.apache.org/msg06026.html
> 
> -- 
> 
> Hippo
> Oosteinde 11
> 1017WT Amsterdam
> The Netherlands
> Tel  +31 (0)20 5224466
> -------------------------------------------------------------
> [EMAIL PROTECTED] / [EMAIL PROTECTED] / http://www.hippo.nl
> -------------------------------------------------------------- 
>

RE: Search performance : MultiIndex

Reply via email to