SOLR Indexing issue, possibly due to NUTCH-1084?

Mike Pountney Tue, 07 Aug 2012 04:51:15 -0700

Hi there,

I have an issue with our Nutch 1.4 deployment, whereby a page that has been 
successfully crawled (readdb -dump gives db_fetched status) is not being 
indexed into SOLR.


Trying to retrieve the content using:

nutch readdb $crawldb -url $url

gives:

java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus 
because org.apache.nutch.protocol.ProtocolStatus

... which appears to be a known bug as per NUTCH-1084.

Could this be the reason why the content is not being indexed? Does 'nutch 
solrindex' iterate through the pages using the same codebas that is failing?

Is there any workaround to the NUTCH-1084 issue? It's occurring on about 10% of 
the pages we've crawled, the rest are fine (and appear to be indexed)

We're running this under the Hadoop 0.20 task/jobtracker incidentally, on a 
single node with no HDFS usage. 

Any help is greatly appreciated.

Mike

SOLR Indexing issue, possibly due to NUTCH-1084?

Reply via email to