Dennis Kubes wrote: >> That's a very nice description - thanks, Dennis. I think it would be >> useful to include it on the Wiki as a case study. > > I will polish it up a bit and put it out there.
Great, thanks. >>> This is all dependent on the size of each local index. Approximately >>> 2-4M pages per index split is good. Over that you may see >>> performance decreases. Scaling that out over many servers you will >>> see almost linear response time. We have almost 100M pages in the >>> index and are seeing subsecond response times on most queries. >> >> Are you running with a sorted index, and using non-zero >> searcher.max.hits? If you use a well-defined PR-like scoring, then >> using this feature could make wonders to the performance, and increase >> the max number of docs per server. > > I don't know about the sorted index. How do I learn about that? > > We basically took the current indexer and extended it to split into > parts. The indexer also splits the segements and linkdb into the same > parts so all data for a single url will be in the same split on the same > search server. We are using searcher.max.hits at 1000 and we did see a > performance increase from that. If you're using non-zero searcher.max.hits with un-sorted indexes, your ranking will be broken, i.e. the code in LuceneQueryOptimizer will make wrong assumptions about the extrapolation of scores for skipped documents. This feature strongly relies on having indexes sorted by PageRank score - see the IndexSorter tool for details. If you don't sort the index by PageRank, you should set this property to <= 0. Try also upgrading Nutch to Lucene 2.2.0, this alone should give you a performance boost of a few percent (if Lucene indeed is the bottleneck). See also my (long) rant about the complexity of Nutch queries: http://www.nabble.com/Performance-optimization-for-Nutch-index---query-tf3276316.html#a9111523 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
