You set that up in your nutch-site.xml file. Open the nutch-default.xml file (located in the <NUTCH_INSTALL_DIR>/conf. Look for this element:
<property> <name>plugin.includes</name> <value>protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> You'll notice the "parse" plugins that uses the regex "parse-(text|html|pdf|msword|rss)". You remove/add the available parsers here. So, if you only wanted pdfs, you only use the pdf parser, "parse-(pdf)" or just "parse-pdf". Don't edit the nutch-default file. Create a new nutch-site.xml file for your cusomizations. So, basically copy the nutch-default.xml file, remove everything you don't need to override, and there ya go. I believe that is the correct way. On 6/6/07, Martin Kammerlander <[EMAIL PROTECTED]> wrote: > > > hi! > > I have a question. If I have for example the seed urls and do a crawl based o > that seeds. If I want to index then only pages that contain for example pdf > documents, how can I do that? > > cheers > martin > > > -- "Conscious decisions by conscious minds are what make reality real" ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general