Try the following settings in your nutch-site.xml:

<property>
  <name>io.map.index.skip</name>
  <value>7</value>
</property>

<property>
  <name>indexer.termIndexInterval</name>
  <value>1024</value>
</property>

The first causes data files to use considerably less memory.

The second affects index creation, so must be done before you create the index you search. It's okay if your segment indexes were created without this, you can just (re-)merge indexes and the merged index will get the setting and use less memory when searching.

Combining these two I have searched a 40+M page index on a machine using about 500MB of RAM. That said, search times with such a large index are not good. At some point, as your collection grows, you will want to merge multiple indexes containing different subsets of segments and put each on a separate box and search them with distributed search.

Doug

Jay Pound wrote:
I'm testing an index of 30 million pages, it requires 1.5gb of ram to search
using tomcat 5, I plan on having an index with multiple billion pages, but
if this is to scale then even with 16GB of ram I wont be able to have an
index larger than 320million pages? how can I distribute the memory
requirements across multiple machines, or is there another servlet program
(like resin) that will require less memory to operate, has anyone else run
into this?
Thanks,
-Jay Pound


Reply via email to