[ http://issues.apache.org/jira/browse/NUTCH-201?page=all ]
Sami Siren updated NUTCH-201:
-----------------------------
Attachment: subcollections-1.patch
> add support for subcollections
> ------------------------------
>
> Key: NUTCH-201
> URL: http://issues.apache.org/jira/browse/NUTCH-201
> Project: Nutch
> Type: New Feature
> Versions: 0.8-dev
> Reporter: Sami Siren
> Assignee: Sami Siren
> Priority: Minor
> Fix For: 0.8-dev
> Attachments: subcollections-1.patch
>
> Subcollection is a subset of an index. Subcollections are defined
> by urlpatterns in form of white/blacklist. So to get the page into
> subcollection it must match the whitelist and not the blacklist.
> Subcollection definitions are read from a file subcollections.xml
> and the format is as follows (imagine here that you are crawling all
> the virtualhosts from apache.org and you wan't to tag pages with
> url pattern "http://lucene.apache.org/" to be part of subcollection
> lucene.
> <?xml version="1.0" encoding="UTF-8"?>
> <subcollections>
> <subcollection>
> <name>lucene</name>
> <id>lucene</id>
> <whitelist>http://lucene.apache.org/</whitelist>
> <blacklist />
> </subcollection>
> </subcollections>
> plugin contains indexingfilter, query filter and supporting classes
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers