Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JamesVictor: http://wiki.apache.org/nutch/GettingNutchRunningWithWindows The comment on the change is: added example for plugin.includes ------------------------------------------------------------------------------ You'll need to delete or move the crawl directory before starting the crawl off again unless you specify another path on the command above. + === Analyzing Additional Resource Types === + + From the ["Features"]: + + Edit `conf/nutch-site.xml` and change the value of `plugin.includes` to include the plugins for the document types that you want Nutch to handle. + + For example, to add parsing for PDF, MS Office, and OpenOffice documents, and use the `index-more` instead of `index-basic`, you'll have something like: + + {{{ + <property> + <name>plugin.includes</name> + <value>protocol-http|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)| + index-more|query-(basic|site|url)|summary-basic|scoring-opic| + urlnormalizer-(pass|regex|basic)</value> + </property> + }}} + == Web Interface for Search == In your Environment Variables settings, add `NUTCH_JAVA_HOME` and the location of your JVM (e.g. `C:\j2sdk1.4.2_09`) as a new Environment Variable. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs