Hi,

I had to build an index of a copy of the 3 million 2007 static wikipedia pages for a project, and it indexed out of the box just following the tutorial, so kudos.

However, I'm trying to speed up the query performance - and the easiest solution I can think of is to mmap the index file. However I have no idea how to do this. Anyone have an idea? Is there some other parameter I can tweak to load/cache the index? Is there some form of index primer around that will pre-cache the indexes? Currently it's about 300 msecs a query (on a really high performance Fedora box with 8GB ram in the Amazon compute cloud). The index is less than 5GB.

The other question I have is regarding anchor text and link analysis. The site is just a dir hierarchy, and I crawled just using 'file:///' - do I need to do a http:// crawl to get Anchor Text to work? Or can I just run a partial rebuild on the segments? Does 0.9 have an approximation of page rank, and if so, does it work on file urls with the same host.

Sorry to bug you guys, but I can't find anything on the wiki thats really helpful, nor can anyone on Nutch User supply an answer to these 2 topics.


Cheers,
 Winton






Reply via email to