Hi guys,
I have been getting a nullpointerexception for the last two days. I am trying to crawl a very large collection of files (about 40Gb). The crawler will fetch and index about 2000 files (included folders) and there will be no issues with parsing. Now I know there are more files than that in the directory but the crawler will fail with the following error: INFO parser.custom: Custom-parse: Parsing content file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf 07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with: java.lang.NullPointerException 07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0 pages/s, 0 kb/s, 07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232) 07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher caught:java.lang.NullPointerException The error also occurs with different file formats not just pdf files. Now I understand that this is a known issues as there were a similar issue open a while ago: HYPERLINK "http://issues.apache.org/jira/browse/NUTCH-220"http://issues.apache.org/jir a/browse/NUTCH-220. At first I thought the error was caused by the parser but I was able to fetch-parse-index this file type before and now in this crawl. The problem is not caused by any parsers or protocol plugins. I am crawling a local drive, therefore if there were a problem with the protocol, a 404 file protocol error (file not found) should be thrown instead. I am trying to get to the bottom of this as I am trying to build an index but this causes the all process to abort. If there is someone from the community that can help, I will be opened to any suggestions. It seems that the error is caused by hadoop process. If this is the case can someone point me to the right direction. Also some plugins have major issues with multi-threads in nutch such the parse-xml plugins, is there anybody who has experienced those issues before. I am looking forward to your views on this issue. I am using Nutch 0.8.2 dev from the branch. Best Regards, Armel _________________________ Armel T. Nene iDNA Solutions LTD Tel: +44 (20) 7257 6124 Mobile: +44 (7886)950 483 Web: http://www.idna-solutions.com Blog: HYPERLINK "http://blog.idna-solutions.com"http://blog.idna-solutions.com -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.441 / Virus Database: 268.17.37/682 - Release Date: 12/02/2007 13:23