Hi,
I had to build an index of a copy of the 3 million 2007 static
wikipedia pages for a project, and it indexed out of the box just
following the tutorial, so kudos.
However, I'm trying to speed up the query performance - and the
easiest solution I can think of is to mmap the index file. However I
have no idea how to do this. Anyone have an idea? Is there some other
parameter I can tweak to load/cache the index? Is there some form of
index primer around that will pre-cache the indexes? Currently it's
about 300 msecs a query (on a really high performance Fedora box with
8GB ram in the Amazon compute cloud). The index is less than 5GB.
The other question I have is regarding anchor text and link analysis.
The site is just a dir hierarchy, and I crawled just using 'file:///'
- do I need to do a http:// crawl to get Anchor Text to work? Or can
I just run a partial rebuild on the segments? Does 0.9 have an
approximation of page rank, and if so, does it work on file urls with
the same host.
Sorry to bug you guys, but I can't find anything on the wiki thats
really helpful, nor can anyone on Nutch User supply an answer to
these 2 topics.
Cheers,
Winton