Solr-3.5.0/Nutch-1.4 - SolrDeleteDuplicates fails

Patrick Durusau Sun, 11 Dec 2011 12:32:57 -0800

Greetings!

This may be a Nutch question and if so, I will repost to the Nutch list.


I can run the following commands with Solr-3.5.0/Nutch-1.4:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5


then:

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*


successfully.

But, if I run:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

It fails with the following messages:

SolrIndexer: starting at 2011-12-11 14:01:27

Adding 11 documents

SolrIndexer: finished at 2011-12-11 14:01:28, elapsed: 00:00:01

SolrDeleteDuplicates: starting at 2011-12-11 14:01:28

SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/

Exception in thread "main" java.io.IOException: Job failed!

    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)

    at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)

    at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)

    at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)

    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

I am running on Ubuntu 10.10 with 12 GB of memory, Java version 1.6.0_26.

I can delete the crawl directory and replicate this error consistently.

Suggestions?

Other than "...use the way that doesn't fail." ;-)

I am concerned that a different invocation of Solr failing consistentlyrepresents something that may cause trouble elsewhere when leastexpected. (And hard to isolate as the problem.)


Thanks!

Hope everyone is having a great weekend!

Patrick

PS: From the hadoop log (when it fails) if that's helpful:

2011-12-11 15:21:51,436 INFO  solr.SolrWriter - Adding 11 documents

2011-12-11 15:21:52,250 INFO  solr.SolrIndexer - SolrIndexer: finished at 
2011-12-11 15:21:52, elapsed: 00:00:01

2011-12-11 15:21:52,251 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: 
starting at 2011-12-11 15:21:52

2011-12-11 15:21:52,251 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: 
Solr url: http://localhost:8983/solr/

2011-12-11 15:21:52,330 WARN  mapred.LocalJobRunner - job_local_0020

java.lang.NullPointerException

    at org.apache.hadoop.io.Text.encode(Text.java:388)

    at org.apache.hadoop.io.Text.set(Text.java:178)

    at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)

    at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)

    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)

    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)

    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)

    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


--
Patrick Durusau
patr...@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
OASIS Technical Advisory Board (TAB) - member

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau

Solr-3.5.0/Nutch-1.4 - SolrDeleteDuplicates fails

Reply via email to