Re: SolrClean not available in nutch 2.x

claudiuchis Tue, 30 Jul 2013 15:31:28 -0700

Hi Lewis,

It didn't work for me.


Here is what I did:

1. I set up a test web site on my local machine.

2. I crawled the site, removed one page, and crawled again.

3. Checked that the page I removed was indexed by Solr, and was flagged as
gone (status = 3) in the database (hbase)

hbase> scan 'webpage', {COLUMNS => ['f:st']}

localhost:http:3000/ column=f:st, timestamp=1375203394614,
value=\x00\x00\x00\x
 tests                03 

4. I applied the patch

cd /usr/local/nutch-2.2.1
patch -p0 < NUTCH-1294-v2.patch

The patch didn't update "src/bin/nutch" and "conf/log4j.properties" for some
reason. I've updated these manually.

5. run the "solrclean" task in distributed mode:

$NUTCH_DEPLOY/bin/nutch solrclean http://localhost:8983/solr

*Expected result:* The "gone" document is removed from Solr.

*Actual result:* The document is still in Solr.

*Additional information:* I enabled logging for SolrClean and it's
dependencies:

log4j.logger.org.apache.nutch.indexer.solr.SolrClean=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleanerJob=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleaningFilters=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleaningFilter=INFO,cmdstdout

Then, I added a LOG.info("method-name"); line to each method in these 4
classes.
This way I found out that the map method in IndexCleanerJob class was not
called, so there were no documents processed.
I will look to find out why this is.

I run:
 - hadoop 1.1.2 (one machine)
 - nutch 2.2.1 with patch NUTCH-1294-v2
 - hbase 0.90.4
 - Java 1.7.0_21

Thanks,
Claudiu.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081481.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: SolrClean not available in nutch 2.x

Reply via email to