some technical advice

Winton Davies Thu, 26 Jun 2008 18:25:29 -0700

Hi,

I had to build an index of a copy of the 3 million 2007 staticwikipedia pages for a project, and it indexed out of the box justfollowing the tutorial, so kudos.

However, I'm trying to speed up the query performance - and theeasiest solution I can think of is to mmap the index file. However Ihave no idea how to do this. Anyone have an idea? Is there some otherparameter I can tweak to load/cache the index? Is there some form ofindex primer around that will pre-cache the indexes? Currently it'sabout 300 msecs a query (on a really high performance Fedora box with8GB ram in the Amazon compute cloud). The index is less than 5GB.

The other question I have is regarding anchor text and link analysis.The site is just a dir hierarchy, and I crawled just using 'file:///'- do I need to do a http:// crawl to get Anchor Text to work? Or canI just run a partial rebuild on the segments? Does 0.9 have anapproximation of page rank, and if so, does it work on file urls withthe same host.

Sorry to bug you guys, but I can't find anything on the wiki thatsreally helpful, nor can anyone on Nutch User supply an answer tothese 2 topics.



Cheers,
 Winton

some technical advice

Reply via email to