What are you using for your Crawl Command Line?  I remember trying to get mine 
to work and there was a line that wasn't very clear in the Tutorial.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

where you had to include where the -solr location was for it to index the 
files.  If they are working separately then I would guess it's somewhere in the 
connection and this was my problem.

Jerry E. Craig, Jr.


-----Original Message-----
From: John R. Brinkema [mailto:[email protected]] 
Sent: Monday, August 01, 2011 11:46 AM
To: [email protected]
Subject: Nutch-1.3 + Solr 3.3.0 = fail

Friends,

I am having the worst time getting nutch and solr to play together nicely.

I downloaded and installed the current binaries for both nutch and solr.  I 
edited the nutch-site.xml file to include:

<property>
<name>http.agent.name</name>
<value>Solr/Nutch Search</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|tika)|
index-basic|query-(basic|stemmer|site|url)|summary-basic|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>http.content.limit</name>
<value>65536</value>
</property>
<property>
<name>searcher.dir</name>
<value>/opt/SolrSearch</value>
</property>


I installed them and tested them according to each of their respective 
tutorials; in other words I believe each is working, separately.  I crawled a 
url and the 'readdb -stats' report shows that I have successfully collected 
some links.  Most of the links are to '.pdf' files.

I followed the instructions to link nutch and solr; e.g. copy the nutch schema 
to become the solr schema.

When I run the bin/nutch solrindex ... command I get the following error:

java.io.IOException: Job failed!

When I look in the log/hadoop.log file I see:

2011-08-01 13:10:00,086 INFO  solr.SolrMappingReader - source: content
dest: content
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: site
dest: site
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: title
dest: title
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: host
dest: host
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: segment
dest: segment
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: boost
dest: boost
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: digest
dest: digest
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: tstamp
dest: tstamp
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: id
2011-08-01 13:10:00,087 INFO  solr.SolrMappingReader - source: url dest: url
2011-08-01 13:10:00,537 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.solr.common.SolrException: Document [null] missing required
field: id

Document [null] missing required field: id

request: http://localhost:8983/solr/update?wt=javabin&version=2
         at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
         at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
         at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
         at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49)
         at
org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:82)
         at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
         at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
         at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-08-01 13:10:01,050 ERROR solr.SolrIndexer - java.io.IOException: 
Job failed!

The same error appears in the solr log.

I have tried the 'sync solrj libraries' fix; that is, I copied 
apache-solr-solrj-3.3.0.jar from the solr lib to the nutch lib with no effect.  
Since I am running binaries, I, of course, did not run ant job.  Is that the 
magic?

Any suggestions?






Reply via email to