whats the bottleneck for the slow searching, I'm monitoring it and its doing about 57% cpu load when I'm searching , it takes about 50secs to bring up the results page the first time, then if I search for the same thing again its much faster. Doug, can I trash my segments after they are indexed, I don't want to have cached access to the pages do the segments still need to be there? my 30mil page index/segment is using over 300gb I have the space, but when I get to the hundreds of millions of pages I will run out of room on my raid controler's for hd expansion, I'm planning on moving to lustre if ndfs is not stable by then. I plan on having a multi billion page index if the memory requirements for that can be below 16gb per search node. right now I'm getting pretty crappy results from my 30 million pages, I read the whitepaper on Authoritative Sources in a Hyperlinked Environment because someone said thats how the nutch algorithm worked, so I'm assuming as my index grows the pages that deserve top placement will recieve top placement, but I don't know if I should re-fetch a new set of segments with root url's just ending in US extensions(.com.edu etc...) I made a small set testing this theory (100000 pages) and its results were much better than my results from the 30mill page index. whats your thought on this, am I right in thinking that the pages with the most pages linking to them will show up first? so if I index 500 million pages my results should be on par with the rest of the "big dogs"?
one last important question, if I merge my indexes will it search faster than if I don't merge them, I currently have 20 directories of 1-1.7mill pages each. and if I split up these indexes across multiple machines will the searching be faster, I couldent get the nutch-server to work but I'm using 0.6. I have a very fast server I didnt know if the searching would take advantage of smp, fetching will and I can run multiple index's at the same time. my HD array is 200MB a sec i/o I have the new dual core opteron 275 italy core with 4gb ram, working my way to 16gb when I need it and a second processor when I need it, 1.28TB of hd space for nutch currently with expansion up to 5.12TB, I'm currently running windows 2000 on it as they havent made a driver yet for suse 9.3 for my raid cards (highpoint 2220) so my scalability will be to 960MB a sec with all the drives in the system and 4x2.2 Ghz processor cores. untill I need to cluster thats what I have to play with for nutch. in case you guys needed to know what hardware I'm running Thank you -Jay Pound Fromped.com BTW windows 2000 is not 100% stable with dual core processors. nutch is ok but cant do too many things at once or I'll get a kernel inpage error (guess its time to migrate to 2003.net server-damn) ----- Original Message ----- From: "Doug Cutting" <[EMAIL PROTECTED]> To: <nutch-dev@lucene.apache.org> Sent: Tuesday, August 02, 2005 1:53 PM Subject: Re: Memory usage > Try the following settings in your nutch-site.xml: > > <property> > <name>io.map.index.skip</name> > <value>7</value> > </property> > > <property> > <name>indexer.termIndexInterval</name> > <value>1024</value> > </property> > > The first causes data files to use considerably less memory. > > The second affects index creation, so must be done before you create the > index you search. It's okay if your segment indexes were created > without this, you can just (re-)merge indexes and the merged index will > get the setting and use less memory when searching. > > Combining these two I have searched a 40+M page index on a machine using > about 500MB of RAM. That said, search times with such a large index are > not good. At some point, as your collection grows, you will want to > merge multiple indexes containing different subsets of segments and put > each on a separate box and search them with distributed search. > > Doug > > Jay Pound wrote: > > I'm testing an index of 30 million pages, it requires 1.5gb of ram to search > > using tomcat 5, I plan on having an index with multiple billion pages, but > > if this is to scale then even with 16GB of ram I wont be able to have an > > index larger than 320million pages? how can I distribute the memory > > requirements across multiple machines, or is there another servlet program > > (like resin) that will require less memory to operate, has anyone else run > > into this? > > Thanks, > > -Jay Pound > > > > > >