Wouldn't it be simply the number of threads that you use to fetch the
pages ?
Doug Cutting wrote:
The latest code in SVN requires less RAM. If you still have problems,
try setting the config option io.map.index.skip to 8, and
indexer.termIndexInterval to 1024. These will both cause less RAM to
be used. On a 1GB machine I have built Nutch systems with over 40M
pages using these settings.
Doug
cao yuzhong wrote:
Have anyone used nutch to index over 90G html pages(about 6 million
pages)?
Is it possible? How many rams does it require?
I tried to use Nutch to index 90G html pages.
My pc has 1G Ram and the JVM parameter set to -Xmx1000m
Following is my problem:
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:194)
at
net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68)
at
net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at
net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)
at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)
Any seggestions?
Best regards!
cyz