Hi Lewis,
It didn't work for me.
Here is what I did:
1. I set up a test web site on my local machine.
2. I crawled the site, removed one page, and crawled again.
3. Checked that the page I removed was indexed by Solr, and was flagged as
gone (status = 3) in the database (hbase)
hbase> scan 'webpage', {COLUMNS => ['f:st']}
localhost:http:3000/ column=f:st, timestamp=1375203394614,
value=\x00\x00\x00\x
tests 03
4. I applied the patch
cd /usr/local/nutch-2.2.1
patch -p0 < NUTCH-1294-v2.patch
The patch didn't update "src/bin/nutch" and "conf/log4j.properties" for some
reason. I've updated these manually.
5. run the "solrclean" task in distributed mode:
$NUTCH_DEPLOY/bin/nutch solrclean http://localhost:8983/solr
*Expected result:* The "gone" document is removed from Solr.
*Actual result:* The document is still in Solr.
*Additional information:* I enabled logging for SolrClean and it's
dependencies:
log4j.logger.org.apache.nutch.indexer.solr.SolrClean=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleanerJob=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleaningFilters=INFO,cmdstdout
log4j.logger.org.apache.nutch.indexer.IndexCleaningFilter=INFO,cmdstdout
Then, I added a LOG.info("method-name"); line to each method in these 4
classes.
This way I found out that the map method in IndexCleanerJob class was not
called, so there were no documents processed.
I will look to find out why this is.
I run:
- hadoop 1.1.2 (one machine)
- nutch 2.2.1 with patch NUTCH-1294-v2
- hbase 0.90.4
- Java 1.7.0_21
Thanks,
Claudiu.
--
View this message in context:
http://lucene.472066.n3.nabble.com/SolrClean-not-available-in-nutch-2-x-tp4081385p4081481.html
Sent from the Nutch - User mailing list archive at Nabble.com.