Hi, I have successfully run the nutch tutorial at http://lucene.apache.org/nutch/tutorial.html using nutch 0.7.1 and have been able to set up the web server etc. using a crawl of a few hundred thousand sites with no problems at all.

Having now crawled, in addition, a few million sites (in one go i.e. it's all in one segment) I am now running into trouble trying to update the webdb with the contents of that segment. I was initially running on a RedHat-9 6Gb RAM 32-bit machine and although none of the file sizes or RAM use exceeded the 32-bit limits, I then swapped to a SuSE-8.1 (for AMD64) 12Gb RAM machine using the 64-bit jdk1.5.0_06 and specifically when I run the following command:

~/nutch-0.7.1/bin/nutch updatedb db segments/20051216172239

where ls -l segments/20051216172239/* gives:

segments/20051216172239/content:
???? 3918828
-rw-r--r--    1 9   coe      4012718873 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

segments/20051216172239/fetcher:
???? 148232
-rw-r--r--    1 9   coe      151626004 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

segments/20051216172239/fetchlist:
???? 106204
-rw-r--r--    1 9   coe      108590211 12?? 16 17:32 data
-rw-r--r--    1 9   coe        157004 12?? 16 17:32 index

segments/20051216172239/parse_data:
???? 3873864
-rw-r--r--    1 9   coe      3966675725 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

segments/20051216172239/parse_text:
???? 1120748
-rw-r--r--    1 9   coe      1147485081 12?? 17 18:23 data
-rw-r--r--    1 9   coe        158068 12?? 17 18:23 index

I get the following errors:

run java in /home/9/jdk1.5.0_06
051222 195653 parsing file:/home/9/nutch-0.7.1/conf/nutch-default.xml
051222 195653 parsing file:/home/lkr109/nutch-0.7.1/conf/nutch-site.xml
051222 195654 No FS indicated, using default:local
051222 195654 Updating db
051222 195705 Updating for segments/20051216172239
051222 195705 Processing document 0
051222 195705 Plugins: looking in: /home/9/nutch-0.7.1/plugins
051222 195705 not including: /home/9/nutch-0.7.1/plugins/query-more
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/query-site/plugin.xml
051222 195705 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/parse-html/plugin.xml
051222 195705 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
051222 195705 parsing: /home/9/nutch-0.7.1/plugins/parse-text/plugin.xml
051222 195706 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-ext
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-pdf
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-rss
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/query-basic/plugin.xml
051222 195706 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/index-more
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-js
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml 051222 195706 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-ftp
051222 195706 not including: /home/9/nutch-0.7.1/plugins/parse-msword
051222 195706 not including: /home/9/nutch-0.7.1/plugins/creativecommons
051222 195706 not including: /home/9/nutch-0.7.1/plugins/ontology
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-file
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/protocol-http/plugin.xml
051222 195706 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http
051222 195706 not including: /home/9/nutch-0.7.1/plugins/clustering-carrot2
051222 195706 not including: /home/9/nutch-0.7.1/plugins/language-identifier
051222 195706 not including: /home/9/nutch-0.7.1/plugins/urlfilter-prefix
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/query-url/plugin.xml
051222 195706 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter
051222 195706 parsing: /home/9/nutch-0.7.1/plugins/index-basic/plugin.xml
051222 195706 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051222 195706 not including: /home/9/nutch-0.7.1/plugins/protocol-httpclient
051222 195706 found resource regex-urlfilter.txt at file:/home/9/nutch-0.7.1/conf/regex-urlfilter.txt
051222 195706 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
051222 195710 Processing document 1000
051222 195712 Processing document 2000
051222 195714 Processing document 3000
051222 195716 Processing document 4000
...
...
051222 202623 Processing document 777000
051222 202625 Processing document 778000
051222 202625 Finishing update
Exception in thread "main" java.io.IOException: Input/output error
       at java.io.FileInputStream.readBytes(Native Method)
       at java.io.FileInputStream.read(FileInputStream.java:194)
at org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:83) at org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:37)
       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
       at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
       at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
       at java.io.DataInputStream.readFully(DataInputStream.java:176)
at org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55) at org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89) at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:309) at org.apache.nutch.io.SequenceFile$Sorter$MergeStream.next(SequenceFile.java:725) at org.apache.nutch.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:755) at org.apache.nutch.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:654) at org.apache.nutch.io.SequenceFile$Sorter.mergePass(SequenceFile.java:591) at org.apache.nutch.io.SequenceFile$Sorter.sort(SequenceFile.java:419) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:535)
       at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
at org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321) at org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)



After this failure the contents of the db directory are as follows (ls -ltr db/*/*):

-rw-r--r--    1 9   coe            17 12?? 20 20:03 db/webdb/stats

db/webdb/linksByMD5:
???? 686364
-rw-r--r--    1 9   coe      697121985 12?? 20 19:59 data
-rw-r--r--    1 9   coe       5713088 12?? 20 19:59 index

db/webdb/linksByURL:
???? 686364
-rw-r--r--    1 9   coe      697121985 12?? 20 20:03 data
-rw-r--r--    1 9   coe       5710026 12?? 20 20:03 index

db/webdb/pagesByMD5:
???? 355200
-rw-r--r--    1 9   coe       3081749 12?? 20 20:05 index
-rw-r--r--    1 9   coe      360637849 12?? 20 20:05 data

db/webdb/pagesByURL:
???? 505324
-rw-r--r--    1 9   coe       1805388 12?? 20 20:08 index
-rw-r--r--    1 9   coe      515641937 12?? 20 20:08 data

db/webdb.new/tmp:
???? 5344972
-rw-r--r--    1 9   coe            75 12?? 22 19:57 pagesByMD5.out
-rw-r--r--    1 9   coe            75 12?? 22 19:57 linksByURL.out
-rw-r--r--    1 9   coe      774132072 12?? 22 20:26 linksByMD5.out
-rw-r--r--    1 9   coe      4008686084 12?? 22 20:26 pagesByURL.out
-rw-r--r--    1 9   coe      690415479 12?? 22 20:42 pagesByURL.out.sorted

so it seems like something went wrong when the 2 sorted streams (pagesByURL.out.sorted.0 and pagesByURL.out.sorted.1) were being merged into pagesByURL.out.sorted. A minute or so prior to dying those 2 files had looked as follows:

-rw-r--r-- 1 9 coe 4008697873 12?? 22 20:33 pagesByURL.out.sorted.0 -rw-r--r-- 1 9 coe 3831078912 12?? 22 20:41 pagesByURL.out.sorted.1

To be honest the above output has not been 100% repeatable. i.e. I have got the above output every time except once. On that occasion processing got further than dying on processing the pagesByURL but instead died on processing the linksByMD5. I am not particularly au fait with Java so any explcit help would be much appreciated.

Thanks, Ed



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to