Have anyone used nutch to index over 90G html pages(about 6 million pages)?
Is it possible? How many rams does it require?
I tried to use Nutch to index 90G html pages.
My pc has 1G Ram and the JVM parameter set to -Xmx1000m
Following is my problem:
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:194)
at
net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68)
at
net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)
at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)
Any seggestions?
Best regards!
cyz