Try the following settings in your nutch-site.xml:
<property>
<name>io.map.index.skip</name>
<value>7</value>
</property>
<property>
<name>indexer.termIndexInterval</name>
<value>1024</value>
</property>
The first causes data files to use considerably less memory.
The second affects index creation, so must be done before you create the
index you search. It's okay if your segment indexes were created
without this, you can just (re-)merge indexes and the merged index will
get the setting and use less memory when searching.
Combining these two I have searched a 40+M page index on a machine using
about 500MB of RAM. That said, search times with such a large index are
not good. At some point, as your collection grows, you will want to
merge multiple indexes containing different subsets of segments and put
each on a separate box and search them with distributed search.
Doug
Jay Pound wrote:
I'm testing an index of 30 million pages, it requires 1.5gb of ram to search
using tomcat 5, I plan on having an index with multiple billion pages, but
if this is to scale then even with 16GB of ram I wont be able to have an
index larger than 320million pages? how can I distribute the memory
requirements across multiple machines, or is there another servlet program
(like resin) that will require less memory to operate, has anyone else run
into this?
Thanks,
-Jay Pound