Shouldn't that be subcollection:wiki instead? Also I assumed you had
subcollection added to plugin.includes in the config file (nutch-site.xml).
Andrew Libby wrote:
Iv'e applied the patch in the ticket linked to below. I browesed the
patch to
try to figure out how to use this plugin, and I'm having troubles trying
to get it
working.
Before I get into the details, if someone has a source of information
describing
how nutch starts up and initializes plugins so that I can get a feel for
if this patch
is even being used properly in the system, I'd very much appreciate it.
----
Here's what I did:
Added patches with patch -p0 < subcollection.2.path
Comiled tarball with ant tar
Extracted tarball in my runtime location with tar -zxvpf -
nutch-0.8-dev.tar.gz
Created urls/urls.txt containing my site name
(http://www.philadelphiariders.com/)
Edited crawl-urlfilter.xml to accept aformentioned site name
Edited subcollections.xml and added the following:
<subcollection>
<name>wiki</name>
<id>wiki</name>
<whitelist>http://www.philadelphiariders.com/wiki</whitelist>
<blacklist />
</subcollection>
<subcollection>
<name>moto-web</name>
<id>moto-web</name>
<whitelist>http://www.philadelphiariders.com/c/dmoz</whitelist>
<blacklist />
</subcollection>
<subcollection>
<name>gallery</name>
<id>gallery</id>
<whitelist>http://www.philadelphiariders.com/gallery</whitelist>
<blacklist />
</subcollection>
Crawled/ indexed my site with ./bin/nutch crawl urls -dir ../nutch-index
When I start tomcat and do some test searching, I get links from the
wiki area
w/o a collection filed added to the query. But if I do something a
query like:
collection:wiki loudon
Which should return documents, I get none. Additionally, if I simply query
collection:wiki, I get no hits.
If anyone has any ideas, I'll be very greatful.
Zaheed Haque wrote:
Maybe this could help you..
http://issues.apache.org/jira/browse/NUTCH-201
Cheers